A very short history of Linux

2020-08-11

Originally posted in reply to How Linux could have be inspired by Unix if it was a closed source OS? on Reddit

AT&T licensed Unix, including source code to several companies and universities. Some of those licensees wrote replacements for the AT&T components of the system and released them as free software. Eventually enough of those components were rewritten that they could run a working system without any AT&T code. Minix was one such system. Linus was inspired by Minix but chose to develop a monokernel, which was Linux.

Hosting a Minecraft Event on Google Cloud

2020-07-15

Originally posted in reply to Hosting a minecraft event [Compute engine question] on Reddit

For brevity, I'm going to assume you've got the instance and all of your users in the us-central1 region, other regions may have slightly different prices and data transfer might be more expensive if some of your users are overseas.

Data transfer in on GCP is free, so you only need to worry about data transferred to the player. Googling around I see people reporting anywhere from 40 to 200 MB of data per hour per player. Assuming you have 300 players for 3 hours, 200 MB * 300 players * 3 hours = 180 GB data transfer out.

Internet egress from us-central1 is $0.12 per GB on the first 1TB of data, so $21.60 for 180GB, well within your free tier allowance.

The consensus on r/homelab seems to be 6-8GB of RAM should be enough.

CPU requirements are a bit less clear, but I'd be surprised if you need more than 4 CPU cores to run an unmodded server.

It looks like none of the predefined machine types have 4 VCPUs and 8 GB of memory, but you can configure a custom machine type to meet that requirement. The E2 standard machine type should be the right fit in terms of CPU power to price for a short term on-demand machine.

Custom E2 standard pricing (us-central1):
$0.021811 per VCPU hour * 4 * 3 = $0.261732
$0.002923 per GB hour * 8 * 3 = $0.070152

You will also need to allocate disk storage for your instance. I doubt you'll need much space, but I/O performance scales with the size of the disk. I would go with an SSD persistent disk in that case, which is $0.170 per GB month, billed at second granularity. Assuming you allocate 20 GB of disk space and delete it immediately after the event (10800 seconds in 3 hours):

Seconds per month = 86400 * 30 = 2592000
($0.170 / Seconds per month) * 10800 = $0.0007083

Adding it all up and rounding to the nearest penny:

Data egress:               $ 21.60
E2 custom (4 VCPU, 8 GB):  $  0.33 
Persistent SSD (20 GB):    $  0.01

Estimated Total (3 hours): $ 21.94

All of that said, you will likely need at least an hour or two to get the instance configured and might want to do a few test runs first. I'd expect to at least double that cost. You can setup the instance ahead of time then "stop" it, which will avoid paying $0.33/hr when it's not in use, but you will still be billed for the disk when the instance is stopped.

I should also point out that if you're inexperienced with running a minecraft server, it might be better to use a hosted service like Minecraft Realms to avoid running into issues and potentially having the server go down during your graduation ceremony. That would be sad. In my opinion, it's worth spending a few dollars on a professionally run server to avoid that.

Earliest use of seat belt signs

2020-06-21

Originally posted as a reply to What year did the seat belt sign come in to use?

The earliest reference to seat belt signs I can find is Scheduled Air Carrier Rules, 14 C.F.R. (1941)

§ 61.342 Seat belt sign. An aircraft shall not be operated in scheduled air transportation unless a suitable means for warning passengers to fasten seat belts is provided. [As added by Amdt. 129, CAR, Sept. 5, 1941, effective Oct. 1, 1941, and Amdt. 130, CAR, Sept. 12, 1941; 6 P.R. 4691, 4753]

The amendment appears in 6 FR 4753

It's unclear where the motivation for this regulation came from, but it is likely mentioned in the Civil Aeronautics Board's meeting minutes for September 5, 1941. Unfortunately it does not appear that these minutes have been digitized. The relevant records should be available in the National Archives RG 197.2.

comp.lang.ada archive

2020-06-18

Originally posted to comp.lang.ada

I'm not fond of Google Groups, so I built my own archive of the comp.lang.ada newsgroup.

https://archive.legitdata.co/comp.lang.ada/

Sources:

The earliest messages here were copied from the net.lang.ada group, which was renamed to comp.lang.ada in 1986. If you have messages from either of these groups that aren't in the archive, I'd love to include them.

Where practical, an additional Date header has been added to each message in ISO 8601 format to aid in chronological sorting. Where no timezone was given, UTC is assumed. Early messages routed via UUCP were often delayed by days as indicated by the difference between the Posted and Date-Received timestamps. In most cases, I use the value from the Posted timestamp.

A spam filter has been applied to the archive. Many thousands of advertisements for prescription drugs, sex acts, spiritual salvation, and prejudice have been removed. I do not wish to host this type of content and are actively working to train better filters and remove spam that slipped through.

This archive is updated hourly via NNTP.

Tape vs External HDD

2020-04-22

Originally posted on Reddit

LTO tape vs external HDD

Source data from diskprices.com

Updated spreadsheet

Amazon Product Advertising API tips

2020-03-10

Originally posted in reply to Puppeteer + Node.js = Web Scraping Prices on Amazon

Yeah, this is my site. I do use the PA API to get pricing information. There's a few things to be aware of if you plan to do something similar.

If you create a new affiliate account, they won't give you an API key until you've referred at least three sales within 90 days. This needs to be done separately for each region.

Once you have an API key, the operating agreement limits what you can do with the data quite a bit, and they do check... Near as I can tell, they have some bots that flag things like outdated prices and give you a week to correct it and send an appeal. Only then does a human look at your site.

They also rate limit your requests to the API starting at 1 request per second and 8640 requests per day. They raise your limit based on 30-day trailing referral revenue, which means you have to write your code with the assumption that you might be subject to the minimum rate limit.

They have some pretty specific rules for "comparison" sites that show prices from multiple places, which I avoid by only displaying Amazon's prices.

Otherwise it's pretty straightforward. They just finished deprecating their old XML-based API yesterday and only support the 5.0 API now. It's more consistent with other modern AWS APIs, but removed a bunch of product detail fields that the old API had. Most of those fields were rarely populated anyway.

https://webservices.amazon.com/paapi5/documentation/read-la.html

Metrics at Uber

2020-02-23

Originally posted as a comment on M3DB, a distributed timeseries database

I setup a lot of Uber's early metrics infrastructure, so I can speak to how they got to the place where building a custom solution was the right answer.

In the beginning, we didn't really have metrics, we had logs. Lots of logs. We tried to use Splunk to get some insight from those. It kinda worked and their sales team initially quoted a high-but-reasonable price for licensing. When we were ready to move forward, the price of the license doubled because they had missed the deadline for their end of quarter sales quota. So we kicked Splunk to the curb.

Having seen that the bulk of our log volume was noise and that we really only cared about a few small numbers, I looked for a metrics solution at this point, not a logs solution. I'd operated RRDtool based systems at previous companies, and that worked okay, but I didn't love the idea of doing that again. I had seen Etsy's blog about statsd and setup a statsd+carbon+graphite instance on a single server just to try out and get feedback from the rest of the engineering team. The team very quickly took to Graphite and started instrumenting various codebases and systems to feed metrics into statsd.

statsd hit capacity problems first, as it was a single threaded nodejs process and used UDP for ingest, so once it approached 100% CPU utilization, events got dropped. We switched to statsite, which is pretty much a drop-in replacement written in C.

The next issue was disk I/O. This was not a surprise. Carbon (Graphite's storage daemon) stores each metric in a separate file in the whisper format, which is similar to RRDtool's files, but implemented in pure Python and generally a bit easier to interact with. We'd expected that a large volume of random write ops on a spinning disk would eventually be a problem. We ordered some SSDs. This worked okay for a while.

At this point, the dispatch system was instrumented to store metrics under keys with a lot of dimensions, so that we could generate per-city, per-process, per-handler charts for debugging and performance optimization. While very useful for drilling down to the cause of an issue, this led to an almost exponential growth in the number of unique metrics we were ingesting. I setup carbon-relay to shard the storage across a few servers- I think there were three, but it was a long time ago. We never really got carbon-relay working well. It didn't handle backend outages and network interruptions very well, and would sometimes start leaking memory and crash, seemingly without reason. It limped along for a while, but wasn't going to be a long-term solution.

We started looking for alternatives to carbon, as we wanted to get away from whisper files... SSDs were still fairly expensive, and we believed that we should be able to store an append-only dataset on spinning disks and do batch sequential writes. The infrastructure team was still fairly small and we didn't have the resources to properly maintain a HBase cluster for OpenTSDB or a Cassandra cluster, which would've required adapting carbon- I understand that Cassandra is a supported backend these days, but it was just an idea on a mailing list at that point.

InfluxDB looked like exactly what we wanted, but it was still in a very early state, as the company had just been formed weeks earlier. I submitted some bug reports but was eventually told by one of the maintainers that it wasn't ready yet and I should quit bugging them so they could get to MVP.

Right around this time, we started having serious availability issues with metrics, both on the storage side- I estimated we were dropping about 60% of incoming statsd events, and on the query side- Graphite would take seconds-to-minutes to render some charts and occasionally would just time out. We had also built an ad-hoc system for generating Nagios checks that would poll Graphite every minute to trigger threshold-based alerts, which would make noise if Graphite was down and the monitored system was not. This led to on-call fatigue, which made everybody unhappy.

We started running an instance of statsite on every server which would aggregate the individual events for that server into 10 second buckets with the server's hostname as a key prefix, then pushed those to carbon-relay. This solved the dropped packets issue, but carbon-relay was still unreliable.

We were pretty entrenched in the statsd+graphite way of doing things at this point, so switching to OpenTSDB wasn't really an option and we'd exhausted all of the existing carbon alternatives, so we started thinking about modifying carbon to use another datastore. The scope of this project was large enough that it wasn't going to get built in a matter of days or weeks, so we needed a stopgap solution to buy time and keep the metrics flowing while we engineered a long term solution.

I hacked together statsrelay, which is basically a re-implementation of carbon-relay in C, using libev. At this point, I was burned out and handed off the metrics infrastructure to a few teammates that ran with statsrelay and turned it into a production quality piece of code. Right around the same time, we'd begun hiring for an engineering team in NYC that would take over responsibility for metrics infrastructure. These are the people that eventually designed and built M3DB.

Rye bread recipe

2020-02-18

Originally posted on Reddit

A beautiful loaf of rye bread

I used the recipe from A World of Breads. The original recipe was in cups and only had 50% hydration, which didn't look right. I also added caraway seeds, because you can't make rye without caraway!

750g KAF All Purpose
250g Arrowhead Mills Rye
100g Molasses
20g  Salt
10g  Caraway Seeds
8g   Instant Yeast
600g Water

Split into 6 small loaves and baked at 500F for 30 minutes. I'll probably do it at 450F next time, as they came out just a little overdone.

3D printed Eurorack parts

2020-01-29

Originally posted in reply to Who’s got a 3D printer and wants a cute little eurorack? on Reddit

I've been experimenting with 3D printing parts to mount eurorack modules as well. My printer has a relatively small working area, so that constraint has driven a lot of my design decisions. Currently, I'm printing adapters that allow me to mount eurorack modules on 2020 aluminum extrusions, which are pretty cheap.

3d printed eurorack multiplexer on a cluttered desk

Responses to Disk Prices comments

2020-01-27

Originally posted in reply to Disk Prices on Amazon on Hacker News.

Hi, this is my site. Thank you for all of the feedback! I can address a few of the issues you've raised...

Missing some products. This is a known bug in how I'm currently importing product data from Amazon's API. Some products are listed with variations that appear on the same product page but are completely separate products in Amazon's catalog. There's no way to get the API to return all of the variations at once, so I have to perform several subrequests to enumerate those. Currently, that doesn't happen. However, I'm in the process of reworking a lot of the data import code to use the new PA-API 5.0 and am planning to make variations work properly with those changes.

Filter by products sold by Amazon. Initially, diskprices.com was set to filter out products not sold by Amazon, but I received quite a bit of feedback asking me to remove that filter as some of the best deals are from resellers. The new PA-API does have a populated Merchant field for most products, so I may try to expose that.

Display prices including shipping costs. This is addressed in the FAQ, but it really comes down to privacy. I'd need you to login with your Amazon account or give me your location in order to compute tax and shipping information and I really don't want the burden of handling PII.

Clicking the Back button does weird things. This is a bug. Whenever you change a filter, a bit of JavaScript updates the URL in your address bar so that you can copy/paste a link to the page you're looking at with the current filters and send it to someone. It appears I'm not capturing the Back event and updating the filters to match the address bar. I'll get this fixed soon.

External SSD category is missing. Up until the last few months, there really weren't many external SSDs for sale on Amazon, but it looks like this is definitely becoming it's own product category, so I'll add it.

Add Amazon.co.jp. Funny story, diskprices.com had support for Amazon Japan when I first launched it, but they suspended my account and sent an email telling me why, written in japanese. Google Translate couldn't make any sense of the email and I'd had some significant data quality issues with the filters, so I decided not to pursue it further. This is the first time somebody's asked for Amazon.co.jp support. I'll look into setting it up again, but Amazon's added some new restrictions on API access across all regions since then, so it's a bit more difficult to get new regions added now.

Account for per-port cost in calculating prices. This is something I've been thinking about for a while. I think this feature ends up looking a lot like pcpartpicker, with a constraint solver bolted on the side... Given a set of parameters for total capacity, redundancy, bandwidth, etc, optimize for the best price/performance. I currently don't have enough metadata about most of the drives to implement this properly and it's a big feature to develop, but it's something I want to experiment with eventually.

Again, thank you all for the feedback!