Using less memory to look up IP addresses in Mess With DNS

137 points by todsacerdoti 8 months ago

Another solution is to use an MMDB (“MaxMind DB”) file [1] which is essentially a binary tree + deduplicated values (same as idea 3.1).

There are several free ASN MMDBs [2,3] but you can also build your own MMDB files from any Prefix->Value mapping with the mmdbwriter library [4] or a CLI tool built on top of it like mmdbctl [5].

Assuming the ASN MMDB is fully loaded in memory, it would use around 60MB.

[1] https://maxmind.github.io/MaxMind-DB/

[2] https://dev.maxmind.com/geoip/docs/databases/asn/

[3] https://ipinfo.io/products/free-ip-data-downloads

[4] https://github.com/maxmind/mmdbwriter

[5] https://github.com/ipinfo/mmdbctl

(I work for IPinfo, but there are lots of other companies offering MMDB files).

seligman99 8 months ago

This is the same idea I used for my Cloud IP lookup tool [1], lets it all work in browser with a small file to search against
[1] https://cloud-ips.s3-us-west-2.amazonaws.com/index.html
- maxmouchet 8 months ago
  
  Oh this is nice, and a cool use of HTTP range requests!
  
  ignoramous 8 months ago
  
  Oh, then you'll love: Hosting SQLite databases on GitHub Pages or any static file hoster (2021), https://news.ycombinator.com/item?id=27016630
- zdyxry 8 months ago
  
  cool
gallypette 8 months ago

https://github.com/adulau/mmdb-server
mattpallissard 8 months ago

"MaxMouchet DB"

zamadatix 8 months ago

If you want to get fun with it you can actually fit the entire IP range in a single 32 bit value. On the internet networks larger than /24s aren't considered valid advertisements so you only need to store 3 bytes (the first 24 bits) of the address to hold the start. If you store than in a uint32 you can use the remaining 8 bits to hold the netmask (you don't actually need all of them but it keeps things byte aligned), which lets you dynamically calculate the end address in just a few CPU cycles. The search still occurs on the base address portion so you don't ever have to calculate the end address until you've already found the right base address.

riobard 8 months ago

> fit the entire IP range in a single 32 bit value
… only if you pretend IPv6 does not exist, which isn't the case in the original blogpost.
But the idea can be extended to fit entire IP (both v4 and v6) subnets in an uint64, given that the minimal IPv6 prefix size to be globally routed is /48, you could use the first byte of the uint64 to mark the address type (e.g. 0x4 for IPv4 and 0x6 for IPv6), the remaining 3 (for IPv4) or 6 (for IPv6) bytes for actual prefix, and the last byte for prefix length.
teddyh 8 months ago

> On the internet networks larger than /24s aren't considered valid advertisements
The number 24 is larger, but the network is smaller.
- zamadatix 8 months ago
  
  Good catch, thanks! This should have been "netmasks" instead of "networks" but it's too late for me to edit now.
  
  AStonesThrow 8 months ago
  
  A netmask is different and deprecated terminology.
  The "/24" you refer to is a "prefix length", i.e. a network prefix being advertised by BGP.
blibble 8 months ago

for ipv4 you can do with with just a single lookup, so real O(1)
store the entire thing in one big array, indexed by the uint24 value of the /24
so ASN for 0.0.0.0 stored in position 0, ASN for 0.0.1.0 stored in position 1, etc
and you can drop 0.0.0.0/8, 10.0.0.0/8, 127.0.0.0/8, 224.0.0.0/4, 240.0.0.0/24 to save 35 /8s
(256-16-16-3)256256*4 bytes needed -> 55mb
mmap it if necessary
- zamadatix 8 months ago
  
  If you do it this way I think you lose all of the prefix size info I.e. every match will claim to be part of a /24 match. Depends if you care showing that portion or not if that matters though.

tptacek 8 months ago

Hey, Julia! I know this isn't really material, but: `hallpass`, our SSH server, probably shouldn't be using 40MB. We had a bug in our `init` that we fixed recently: `hallpass` is a Go program, and we were passing user environment variables to it, so users who were passing Go memory settings were also passing settings through to `hallpass`.

If you're still seeing `hallpass` using that much memory, let me know! We'll figure out what's going on.

vlovich123 8 months ago

Since this is fairly static data, preprocess the on-disk format to match your memory layout and just mmap the file so that your memory usage for the DB is effectively 0 and the OS kernel pages it in/out for you based on system-level needs?

mstremer 8 months ago

Or get a ready-made solution like libloc that implements this already with a tree, deduplication and blazing-fast lookups.
See https://location.ipfire.org/
- 1a527dd5 8 months ago
  
  I feel like you should add a slight disclaimer that you are involved in the project in some capacity :)

hnenjoyer 8 months ago

Would be nice to see revisited SQLite solution.

Rest of the described approaches relies on data being in memory rather on disk - not sure whether it was microsd card, SSD over USB, etc. but there's in memory option for SQLite.

Having that in mind it's time to optimize storage - instead of text just store using two 64 bit integers and it was already noticed two indexes are not very helpful - just create one that has both columns included - preferable as primary key so you don't waste space and already have rest of the select data included in single lookup.

Happy to see revisited results, but still not sure if that's a best approach for your problem - but seems like obvious improvements for generic SQL approach.

zamadatix 8 months ago

> As an aside – I’m storing the ASN in a uint32, is that right? I looked in the ip2asn file and the biggest one seems to be 401307, though there are a few lines that say 4294901931 which is much bigger and would not fit in a uint32. I decided to just ignore the gigantic ASNs and assume I could fit everything in a uint32.

Does anyone have a theory to how that 4294967296 might have made its way into the data? I'm getting 0 matches on the actual internet with:

    birdc show route protocol peer4 | grep 4294967296 -c

And ASNs are certainly not >32 bit.

Edit: actually, I'm not seeing it in the linked iptoasn data either?

de46le 8 months ago

https://datatracker.ietf.org/doc/html/rfc6996#section-5 - it's in the range reserved for private use, so it's probably network internal to wherever the list is originally from?
maxmouchet 8 months ago

Bogus announcements are probably filtered by your upstream(s) (see [1] for a common list of filters).
IP-to-ASN mappings are typically built from route collectors [2,3] that peer with various networks and receive their announcements. AFAIK route collectors don't filter anything and it's easy to find bogus announcements (e.g. private ASNs) in the data.
I can't find 4294967296 from a quick glance at the latest RouteViews data but I can find other private ASNs. For example AS7594 - AS2764 - AS4294901866 for 210.10.189.0/24 seen by the route-views.perth collector.
I don't know what kind of filtering iptoasn.com is doing but at work (ipinfo.io) we do filter bogus origins, as well as a bunch of other things like RPKI/IRR-invalid routes and hyper-specific prefixes (> /24 or /48) [4].
[1] https://bgpfilterguide.nlnog.net
[2] https://www.routeviews.org/routeviews/
[3] https://www.ripe.net/analyse/internet-measurements/routing-i...
[4] https://hyperspecifics.io
- zamadatix 8 months ago
  
  Actually 4294967296 couldn't ever appear as the maximum value you can fit in the protocol field is 1 less than that... my problem here was I couldn't manage to keep the 2 numbers I was comparing (the one in the article and 2^32) straight haha! This was mistake was noted by a commenter here https://news.ycombinator.com/item?id=41963745
  That said you're ultimately right that my upstream provider is filtering the 4294901866 value from the article as well anyways for the reasons you stated.
  
  maxmouchet 8 months ago
  
  Ah right haha. Thanks for the heads up, I should have checked ^^
rayhaanj 8 months ago

Also 4_294_901_931 is actually just small enough to fit in a u32 whose max value is 4_294_967_295
- astrobe_ 8 months ago
  
  Perhaps hex notation helps to see things better: 0xFFFF00AB. It's actually nearly 64K below the maximum. 0xAB is 171.
- zamadatix 8 months ago
  
  Ha, this is exactly right! It's 100% in u32 range and part of the private use reservations so wouldn't appear in the internet table unless such values aren't filtered by the provider and/or it's being used locally.
  Originally I went to write your exact comment because it seemed the value in the article should fit at a glance and then I must have done the comparison check backwards because I started pasting the 2^32 value in the rest of my comment concluded it was actually too large when really I had just jumbled things about.
  Thanks for setting my mind straight!

tialaramex 8 months ago

The ip2asn-v4-u32.tsv file has

    996520704       996520959       4294901931      Unknown AS4294901931

    3523770368      3523770623      4294901931      Unknown AS4294901931

    3523771136      3523771391      4294901931      Unknown AS4294901931

moontear 8 months ago

Interesting read - but how cool is Mess with DNS! What a nice live learning tool.

immibis 8 months ago

You need an array of 64-bit values: a 32-bit end address, and a 32-bit ASN. Use 0 for unallocated ranges - it's an invalid ASN.

For IPv6 you need more. It just barely fits if you consider that all public Internet addresses start with 001 and only up to 48 bits can be a published prefix - that's 45 bits - and the public ASNs go up to 19 bits according to your article - that's exactly 64 in total (those high ASNs are bad data from wherever you're getting it from). But it won't work once the next 100000 ASNs get assigned, so you'd better go up to 96 or 128 bits and store the ASN properly in 32 bits.

For cache efficiency you should have one table of addresses and another parallel table of ASNs. Then every access to the address table that pulls in a cache line doesn't waste half the cache line with ASNs you aren't looking for. It won't affect the total memory use, however.

A separate table would hold the name and country for each ASN.

pjmlp 8 months ago

A good example on how to use better data structures, instead of the usual rewrite into something else.

ficklepickle 8 months ago

For even more savings: One netip.Prefix will use less memory than two netip.Addr

Better still, use the free geolite ASN MMDB with geoip2-golang[0]. Or the lower-level maxminddb-golang[1] if you only need certain fields.

0 - https://github.com/oschwald/geoip2-golang

1 - https://github.com/oschwald/maxminddb-golang

gmuslera 8 months ago

The trie approach seemed to be a bit plain. Tries are particularly good for storing netblocks, and see if a particular IP belongs to some block stored there. But is better to use binary values instead of strings, and to use patricia or radix trees as they might compress strings of bits without branches with a single node. And, anyway, postgresql already have the netblock type with efficient indexes for them.

But if that is used for individual IPs, without worrying about blocks they belong to, probably won't get big gains in that area.

ignoramous 8 months ago

For hierarchy like IP addresses, I've seen programs use specialized data structures like CritBit Tries [0] and Allotment Routing Tables [1].
[0] https://news.ycombinator.com/item?id=3015246
[1] https://github.com/openbsd/src/blob/2bd42e97200bee/sys/net/a...

Timber-6539 8 months ago

Had the same issue with restic taking up a lock in the event of an unsuccessful operation. I have a very simple workaround for it.

ExecStartPre=/usr/bin/restic unlock

ExecStart=/usr/bin/restic backup

LtdJorge 8 months ago

That has the benefit of Systemd not letting more than one instance run at the same time, which I'm guessing is one of the reasons the lock was needed in the first place.
You just have to make sure that the type of Service used is correct, so that Systemd can track whether Restic has actually stopped running.

wvh 8 months ago

SQL is not going to be of much help when the dataset easily fits into memory, but I'm a bit surprised by the lack of performance of the trie. I guess the overhead is too much when the dataset has a lot of single IP addresses (only one byte difference), but I imagine it would be helpful for sparse IPv6 addresses.

Go has an excellent standard library, but the solutions in there rarely compete with others writing a dedicated library to solve a hard problem they really had to solve.

1a527dd5 8 months ago

I wonder if DuckDB could do better than SQLite?

reincoder 8 months ago

Tried it. I think it is marginally better than SQLite, but it is not the best solution.
I work for IPinfo and was quite excited when they added support for the IP address data type. However, at the end of the day, the most efficient lookup mechanism has to be the MMDB database. Rather doing the enrichment of IP addresses inside of DuckDB, the better solution was doing an enrichment outside of DuckDB using the MMDB database, dump it on a csv and inserting the CSV in DuckDB as a table.
See the top comment of the thread for more context around MMDB.
- 1a527dd5 8 months ago
  
  Yeah I gave it a shot, it didn't have much over SQLite other than the file size was a lot smaller.
  Also, I love ipinfo excellent service! We use it all the time at work.
  
  reincoder 8 months ago
  
  Awesome. We’ve integrated with platforms like GCP BQ, Snowflake, Clickhouse, and Postgres etc. and still, MMDB stands out as the most performant, especially for enriching millions of IPv6 addresses.
  The best approach might be to enrich IP addresses via an MMDB database, then dump the results in temp table and query the results as you would any table. If you haven't tried the free db, you should as this illustrate this point pretty well: https://ipinfo.io/products/free-ip-database
  It’s great to hear that you’re using our service! Please feel free to reach out anytime if you have any queries/feedback. This thread alone contains comments from 2 IPinfo engineers and me!

alam2000 8 months ago

[dead]

OhNoNotAgain_99 8 months ago

[dead]