Alex Becker — Django is Dangerous

Django is Dangerous

Django is a web framework that consists of everything a backend engineer would think a website might need. It's an ORM tool, a templating engine, a content management system, an admin console, and a collection of utilities. It was invented to power my small-town newspaper's website^[0], for which it may have been well-suited. But it has two massive flaws that cripple many of the websites that use it today.

No Data Integrity

Data integrity problems are particularly insidious, because they tend to lie dormant until a site starts to get a lot of traffic and accumulates lots of data, and because bad data, unlike bad code, often cannot be fixed. Django provides plenty of data integrity footcanons for the careless.

`ATOMIC_REQUESTS = False` by Default

Among many other things, Django provides an ORM: instead of using SQL directly, you create "models", which are Python classes with "fields" that represent various pieces of data, and the framework handles their serialization and deserialization. In doing so, it also tries to hide the finer points of database interaction from you.

Among these points is transaction management, where you decide which series of database queries need to take place as atomic transactions^[1]^[2]. Since the framework has no insight into the conceptual relationships between queries, it has essentially two choices: the most fine-grained strategy (only individual queries are atomic) or the most course-grained strategy (every HTTP request is processed atomically). These choices offer the classic trade-off between performance and safety, which everyone agrees should default towards safety. Django allows the user to control this behavior via the ATOMIC_REQUESTS setting, but it defaults to False—the unsafe but more performant choice.

Common Django usage patterns exacerbate this terrible design decision. Most Django views^[3] look like the following:

def my_view(request):
    instance = GrabBagOfData.objects.get(...)   # retrieve and deserialize a model instance from the database
    instance.some_godawful_expensive_method()   # modify the instance object
    instance.save()                             # serialize and store the modified instance
    return HttpResponse(...)

Unless explicitly told otherwise, the save method writes every field on the model back to the database, including those which have not been modified—silently overwriting any changes that have been made since the model instance was retrieved. Since GrabBagOfData probably serves numerous distinct purposes, all sorts of AJAX requests are firing off concurrently to modify different fields and trampling each other's updates. Hell, some_godawful_expensive_method is often so slow that the user will manually trigger new requests before it has finished.

`get_or_create` and `update_or_create` are non-atomic

These methods allow you to get or update a model instance specified by certain parameters, if it exists, and creates such an instance if it does not. What is the point of using these methods instead of, say, trying a get and falling back to a create? One would probably expect them to prevent data races, i.e. if an instance is created after the get fails but before the create is issued, these methods should prevent a duplicate instance from being created. But the Django documentation gently notes otherwise, in an unhighlighted paragraph of text more than a page into the notes for get_or_create:

This method is atomic assuming correct usage, correct database configuration, and correct behavior of the underlying database. However, if uniqueness is not enforced at the database level for the kwargs used in a get_or_create call (see unique or unique_together), this method is prone to a race-condition which can result in multiple rows with the same parameters being inserted simultaneously.

In fact, the implementation makes no attempt to prevent data races, relying entirely on the database^[4]. To add insult to injury, it recommends lowering the MySQL isolation level^[5], thereby making your entire system less safe.

Validation defined on the model is only enforced on the form

Django models can be associated with one or more "forms", which are Python representations of HTML forms used to create and update model instances. Forms have a lot of default behavior, particularly validation, which is controlled by properties of the model. In fact, many properties of the model exist only to control the forms' default behavior. Because nobody ever modifies a model instance expect through its form, right? And good luck keeping track of which constraints are where. Non-null? That's on the model. Define choices on a field, explicitly enumerating what values it can have? That's on the form. Uniqueness? On the model. Decimal field? The model will take any string!

This inconsistent validation also results in a classic terrible user experience: forms, pre-populated with the existing data for an object, that cannot be submitted because the existing data is invalid.

Invalid values are silently coerced to `None`

Not consistently of course. I can't tell you how many times I've run something like the following:

> Model.objects.filter(field__isnull=True)      # retrieve every model instance where field is None
[]                                              # looks like there are no such instances
> Model.objects.get(id=12345).field is None     # check whether a particular model instance has field set to None
True                                            # it does

What has happened here is that the serialized model instance has an invalid value, but the value is not NULL, so the first query finds nothing. The second query deserializes the instance, and upon encountering the invalid value, silently treats it as if it were NULL, instead of raising an error like any sane function would. This tragedy is particularly common with datetimes, since poorly-behaved applications tend to ignore all the nastiness involved in handling datetimes correctly^[6]^[7].

Duplicate fields abound

Since queries can't use computed columns and adding a real column is only a makemigrations away, models steadily accrue duplicate fields that represent the same data in different ways. Since Django doesn't support complex constraints, these fields inevitably drift out of sync. Pretty soon half of your Vehicles have is_motorcycle == True and wheel_count == 4, and you're not sure which field to trust (hint: neither).

One of the great things about Python is that you can refactor inconsistent properties like this with the @property decorator. But while the ORM allows you to access columns as properties, the reverse is not true, so you have to manually refactor every query.

Dynamic Templating

Dynamic templating gives your views an enormous surface and makes them effectively untestable^[8]. But suppose that you come up with a hack to kinda-sorta test a page. Django's templating engine offers two different kinds of template inheritance, the extend and include tags. This alone means that a single template may have content spread over many files, but it gets worse. While a template can only extend one other template, but it can include arbitrarily many, and can even include templates dynamically—making it nearly impossible to tell what the template looks like after resolving inheritance, or even how many different templates you have. I've worked on systems where nobody knew the order of magnitude—and nobody would be surprised to learn it was in the millions.

Forms, our friends from the previous section, are also the kings of dynamic templating—large, critical chunks of entirely dynamically created HTML—and as such bring their own pains. Change the name of a field? Since <input> names and ids are dynamically generated, now your JavaScript and your CSS are broken! A designer wants to make an adjustment? Chances are they'll have to touch the Python code too. What could go wrong?

Performance naturally suffers too. In addition to the obvious inefficiency of building every page per-request, you have to compress them per-request as well! Or you would, if you didn't have to abandon compression completely to prevent BREACH attacks, and as a result send several times as much data with each response.

Conclusion

These are far from all the problems with Django^[9]^[10], but what makes them the most insidious is that they grow with scale. Most of these are minor issues when you are small and your project is simple. And Django is useful, which is why so many projects pick it up early on. Then as your project grows, these problems get worse and affect more users.

Can Django be fixed? Most of the data integrity issues are implementation mistakes that could be fixed with relatively little work and without introducing too much backwards incompatibility. Performance would certainly suffer, particularly for get_or_create and update_or_create, but performance has never been a priority for Django. Moving validation to the model would be the primary source of backwards incompatibility, but it seems likely that any system which relies on storing invalid values is already broken. It would be difficult to offer a good way around creating duplicate fields, but this at least is a problem with most frameworks. Dynamic templating, on the other hand, was a fundamental design mistake, and migrating away from it would be almost as much work as switching frameworks altogether, so it will probably never leave Django.

In the best possible outcome, where Django fixes most of its data integrity issues and various other warts, I still would recommend against using it.

^
The Lawrence Journal-World, served HTTP-only and from the "www2" subdomain.
^
Atomicity is a central concept in concurrent computing, and a full introduction would not fit in a footnote. In brief, an atomic transaction is a collection of database queries which cannot be interrupted by another query, and which from the perspective of another query all appear to execute at the same time.
^
Making this choice for the developer requires turning autocommit on, in direct contradiction of PEP 249.
^
Views are the functions that process incoming requests and return responses.
^
The implementation assumes that creating a duplicate will raise an IntegrityError, which is only the case if duplicates violate a database constraint.
^
From the documentation:

If you are using MySQL, be sure to use the READ COMMITTED isolation level rather than REPEATABLE READ (the default), otherwise you may see cases where get_or_create will raise an IntegrityError but the object won't appear in a subsequent get() call.
^
Leap Day, Daylight Savings Time, and months indexed from 0 (thanks JavaScript) are all common causes.
^
A surprisingly common response is "don't have poorly behaved applications modifying your database". I wonder what kind of utopia these people live in, where they control, or even know about, all the applications modifying their database.
^
I would like to thank Tim Best for pointing this out.
^
Like the queryset methods first() and earliest(field), which is like order_by(field).first() except when the queryset is empty, in which case first() returns None while earliest(field) raises an ObjectDoesNotExist exception.
^
Or the FileField and FieldFile API, where the documentation claims that FieldFile behaves like Python's file object, for instance documenting FieldFile.open() as:

Behaves like the standard Python open() method and opens the file associated with this instance in the mode specified by mode.

However, instead of returning an open file (which is helpfully a context manager) like the builtin open(), FieldFile.open() returns None, so none of the same patterns apply.

Django is Dangerous

No Data Integrity

ATOMIC_REQUESTS = False by Default

get_or_create and update_or_create are non-atomic