Survival Analysis: Backblaze Hard Drives

Backblaze, a cloud backup service, provides one of the best public services on the internet by periodically posting hard drive failure rates for the drives in their datacenter. Because they use large numbers of consumer hard drives, which are the same ones I would consider buying to use in my desktop computer, I like to consult their blog whenever I am shopping for a new one.

The key metric presented in Backblaze's blog posts, the annualized drive failure rate, is a reasonably good starting point for understanding which hard drives are more reliable than others. However, not all of Backblaze's drives are the same age, an individual drive's chance of failure may vary over its lifetime in a way that is not captured by a simple summary statistic. For example, a failure rate may hold steady at a reasonable value while the drive is under warranty, only to fail at a higher rate after it hits a certain age.

Capturing patterns like this is the domain of survival analysis, and in particular a visualization called the "survival curve". This is a plot of the fraction of a population that hasn't had some terminal event happen to it yet ("death" or "failure") as a function of time elapsed since some starting point. This is a commonly used technique in studies of medical patient survival rates after receiving some treatment, but equipment failure is another good application.

Backblaze commendably makes fully granular source data available in addition to the summary statistics in their blog posts. Ross Lazarus, an Australian computational biologist, used this dataset to build survival curves for hard drives in a series of blog posts in 2016. Reading this series was the first I had heard of a survival curve, and it seems like such a great visualization for understanding this data that I'm surprised Backblaze doesn't report it themselves.

In this post I'll repeat Ross Lazarus' analysis using data that has been updated through Q3 2019. I also refine his technique somewhat by using Apache Spark to improve performance and expressiveness of the aggregation.

Designing the aggregate

To construct the survival curve, we will use a Kaplan-Meier estimator. This estimator builds survival curves from data in a way that handles missing data in a useful way. This is important because not every hard drive will be in the datacenter long enough to fail. For example, if a hard drive is retired from the data center after only one year to make space for one with a larger capacity, then this drive didn't fail – instead, it is "right-censored" in the data. This means we don't know exactly how long it would have taken for it to fail, but we know that it worked without failing for at least one year. Kaplan-Meier regression enables us to use this partial information to build the survival curve instead of throwing it away.

Backblaze provides a dataset that has one row per drive, per day that it operated, with a full snapshot of the drive's self-reported SMART stats on that day. For our purposes, we need only one row per drive with two pieces of information: how long it operated for and whether it failed or not. In theory, this is a straightforward aggregate to run, but because the raw dataset Backblaze provides is many gigabytes in size, it can't be run in Pandas because it can't be fit into memory. A person who wears suits to work might refer to this as a "big data" problem.

Ross Lazarus solved this by exploiting the fact that the dataset is ordered and partitioned by day. His Python program loads one day at a time and keeps track of the first and last day it observes each serial number. This is a clever approach that works pretty well, but specifying it in this way (the "imperative" style of programming) isn't as expressive as it could be using a declarative language like SQL. Using SQL, I could express our desired aggregate as

    max(date) as retired_date,
    min(date) as launched_date,
    count(date) as observed_days,
    max(failure) as failure
from source_table
group by manufacturer, model, serial_number

Which I find to be a more concise description of what's going on. Unfortunately, to actually run this query, I would typically need to first load the data into a database server such as Postgres. Upon load, the database program would rewrite all that data into its internal data format, effectively making a copy of the large input dataset. Even after that cumbersome operation is done, running a SQL query might not be very fast unless I am using an analytics-optimized database implementation such as Redshift.

Apache Spark is a great goldilocks option here. Spark is an analytics framework that provides a SQL interface as well as a dataframe interface for specifying data transformations. Crucially, it can perform those transformations on data in a wide variety of formats, such as a folder full of CSVs (which is what I have).

Strictly speaking, Spark is designed to work best by distributing the work across multiple computers in a supercomputing cluster, but it also does a perfectly serviceable job distributing the work among the four processing cores in my one lonely computer. The ability to use all the cores in my processor is the key element that makes this approach faster than traversing the same data in a single-threaded Python program.

import os

import pandas as pd
from lifelines import KaplanMeierFitter
from humanize import naturalsize
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import pyspark
import pyspark.sql.functions as F

Loading data into Spark

To prepare for this analysis, I downloaded all the datasets from Backblaze's site using Safari, which automatically unzips them. I placed all the resulting folders into a "data" parent folder. Then I renamed each of them with the dataset= prefix, which is a trick that causes Spark to treat them as partitions of the same dataset when I point it at the parent folder. This can be any prefix that ends with an equals sign. I kept Backblaze's original folder names, which haven't been 100% consistent, but that doesn't matter for us.


The uncompressed data comprise 40GB, which is much more than the amount of RAM I have. It probably would take up much less space if it were in a more efficient format such as Parquet, but for this exercise I'm trying to do my analysis on the raw files in-place.

!du -sh data
 40G	data

Load the data into a Spark dataframe:

df ='data', header=True)

Unlike Pandas, this line of code doesn't put all the data into memory. Instead, Spark just peeks at the file layout and the file headers to construct a dataframe object using only the metadata. The object in memory only makes reference to the files on disk; it hasn't looked at the contents yet.

Operations on the actual data will be evaluated "lazily", meaning Spark will put them on a to-do list and defer them until we actually need to know something definite about the result (such as counting the number of rows it has, or loading it into Pandas). When it comes time to actually do the work, Spark delegates the job of actually looking in the CSV files to a set of executor processes.

Data cleaning

I select only the columns that we will be using, trim whitespace, and cast them to the correct data types:

dff =

Lazy evaluation means that none of this work actually happened yet, so if there are errors I won't know until later. The limit method is a good way to test that things will work using only the first few rows of data.

dfm = dff.withColumn(
    F.when(F.col("model").like("ST%"), "Seagate")
    .when(F.col("model").like("Hitachi %"), "HGST/Hitachi")
    .when(F.col("model").like("HGST %"), "HGST/Hitachi")
    .when(F.col("model").like("% %"), F.split(F.col("model"), " ")[0])

The logic above adds a column to the dataframe using logic equivalent to a SQL CASE statement, which is a neat thing that is easier to do in Spark than in Pandas.

Data aggregation

To write my aggregate using SQL, I first need to make the dataframe available in the Spark SQL context.


Now it's time to write the aggregation query. There are some quirks in the data that I need to code around.

  • According to the documentation, the failure flag is supposed to be set to 1 on the last day that the drive was operating. But I found examples where the drive is marked as failed but continues to operate for many days later, sometimes with the flag set back to zero. Maybe this means the drive didn't really fail, but to be safe I treat a drive as failed if it ever had this flag set on any day (max(failure) as failure). I also record the date that the failure flag first appeared (min(case when failure=1 then date end) as failed_date) instead of relying on that to be the same as max(date).
  • In theory, all drives of the same model should have the same capacity. But the capacity_bytes of some drives is sometimes missing for a day (marked as -1 in the data), but available for other days. This means we can't safely group by this column. I select max(capacity_bytes) as capacity_bytes instead.
  • The number of hours a drive has been operating (smart_9_raw) could be useful in this analysis, but is self-reported by the drive so may have errors that I haven't cleaned up. I include it in this aggregate even though I don't plan to use it for this.

The query I wrote is similar to the one I quoted earlier in the post:

drive_spans = spark.sql("""
    max(date) as retired_date,
    min(date) as launched_date,
    count(date) as observed_days,
    max(capacity_bytes) as capacity_bytes,
    min(case when failure=1 then date end) as failed_date,
    max(smart_9_raw) as max_hours,
    min(case when failure=1 then smart_9_raw end) as failed_hours,
    max(failure) as failure
from drive_days
group by manufacturer, model, serial_number

This code defines our big, slow aggregate in terms of the SQL query, but doesn't actually perform any calculations until we perform some action on the resulting dataframe (because of lazy evaluation).

If we expect to do more than one thing with the drive_spans dataframe, it's a good idea to use the .cache() suffix, which effectively establishes a checkpoint in the chain of lazy evaluation. This tells Spark to save the result in memory or in a temp file so that repeated actions on the dataframe don't result in Spark having to start all over from the un-aggregated data each time.

I run a count() to actually do the aggregate:

CPU times: user 30.7 ms, sys: 14.3 ms, total: 45.1 ms
Wall time: 5min 11s

This took ~5 minutes on my quad-core Ivy Bridge desktop. With irrelevant details aggregated out, it's now small enough to load into Pandas.

Survival analysis

To do my survival analysis, I need to add a few more columns to the aggregate.

  • The number of days the drive lasted: difference between the max and min dates that the drive was seen in the data center. I use coalesce to use the failure date when it exists, with the max observed date as a fallback, to handle cases when the drive failed several days before being removed.
  • Some drive model-level metadata that I'll use for context / ordering. I use a window function to calculate:
    • the date that any drive of this model was first introduced to the datacenter
    • storage capacity of the model. I need to do this because of glitches at the indivdual-drive level. Because the capacity of individual drives can be anomalously high and low, I use the median as the aggregate function rather than max or min.

Window functions are a handy way to attach per-model summary stats to this dataset while keeping it at the serial-number granularity.

dfsurv = spark.sql("""
    datediff(coalesce(failed_date, retired_date), launched_date) as duration,
    min(launched_date) over (partition by model) as model_introduced,
    percentile_approx(capacity_bytes, 0.5) over (partition by model order by capacity_bytes) as model_capacity
from drive_spans

Now select the columns we need for the analysis and load them into Pandas.


pdsurv = (
        launched_date=lambda x: pd.to_datetime(x.launched_date),
        model_introduced=lambda x: pd.to_datetime(x.model_introduced),
CPU times: user 1.89 s, sys: 92.1 ms, total: 1.98 s
Wall time: 4.86 s
manufacturer model model_introduced model_capacity launched_date duration failure
S2X6BE9Z Seagate ST9250315AS 2014-01-29 250059350016 2014-02-14 1581 0
S2X6BYB6 Seagate ST9250315AS 2014-01-29 250059350016 2014-02-14 1600 0
S2X3XJ5Y Seagate ST9250315AS 2014-01-29 250059350016 2014-02-14 472 1
S2X6AD6A Seagate ST9250315AS 2014-01-29 250059350016 2014-02-14 1576 0
S2X6CLBK Seagate ST9250315AS 2014-01-29 250059350016 2014-02-14 1454 0

Now it's time to fit and plot the KM survival curve. I'm using the lifelines Python module, but you could also switch to R here and use the survival or survminer packages.

YEARS = 365.25
kmf = KaplanMeierFitter() / YEARS, event_observed=pdsurv.failure)
ax = kmf.plot()

ax.set_ylabel("survival %")
ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1, decimals=0))

The survival curve shows the percentage of drives surviving after a given period of use. We can read the failure rate off as the slope of the curve: about 2% per year. The curve is more or less linear, indicating that the drives Backblaze is buying fail at a more or less constant rate overall.

We do see a steep drop, and a broadening of the confidence interval, at the five-year mark. But few drives have been in the datacenter that long (and the full dataset barely spans a 6 year period) so I wouldn't read too much into it.

It is more interesting to break this down by manufacturer:

ax = plt.gca()

for name, grouped_df in pdsurv.groupby("manufacturer"):

    if len(grouped_df) < 100:
        # skip small sample manufacturers

    kmf = KaplanMeierFitter() / YEARS, grouped_df.failure, label=name)

ax.set_ylabel("survival %")
ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1, decimals=0))

A real disparity is evident here between the expensive HGST drives and the consumer grade drives such as Seagate, Western Digital and Toshiba. The enterprise drives have less than 5% failure rate even after 4 years, whereas the consumer drives all exceed that rate before their second year of use.

The curve for Toshiba is very interesting because it grows steeper and then levels off. This may indicate that these drives had a problem that caused some of them to wear out at a faster rate after a year and a half of use. The fact that the curve levels back out could indicate that reliability improves after the bad apples fail their way out of the population.

Making inferences at the drive brand level is not really that actionable as a consumer since a lot of these behaviors are specific to the individual model of drive. Let's look at individual models of Seagate drives.

ax = plt.gca()

mask = pdsurv.manufacturer == "Seagate"
gcols = ["model_introduced", "model_capacity", "model"]

for group, grouped_df in pdsurv[mask].groupby(gcols):

    model_introduced, model_capacity, model = group

    if len(grouped_df) < 500:
        # skip small sample groups

    label = "{} ({}, {})".format(
        model, naturalsize(model_capacity), model_introduced.year

    kmf = KaplanMeierFitter() / YEARS, grouped_df.failure, label=label)

ax.set_ylabel("survival %")
ax.legend(loc="center left", bbox_to_anchor=(1, 0.5))
ax.yaxis.set_major_formatter(mtick.PercentFormatter(1, decimals=0))

There are a couple things to learn here. First of all, 2013 was a dark time for hard drives – prices were high and quality was low. The 3TB Seagates were so bad that Backblaze published a special episode of their hard drive series devoted entirely to discovering what was going on. I myself had one of these drives fail in my computer, which is part of what radicalized me into running statistical analyses on hard drive data.

Viewing the survival plot at the full scale makes it hard to see the drives that you would actually buy in 2019, since those have shorter histories. Zoom in to the upper left corner to get a look at how the new drives are doing:

ax.set_ylim(top=1, bottom=.9)