I have been working with Airflow for more than three years now and overall, I am quite confident with it. It’s a powerful orchestrator that helps me build data pipelines quickly and in a scalable fashion while for most things I am looking to implement it comes with batteries included.
Recently, and while preparing myself to get a certification for Airflow, I’ve come across many different things I had literally no clue about. And this was essentially my motivation to write this article and share with you a few Airflow internals that have totally blown my mind!
1. Scheduler only parses files containing certain keywords
The Airflow Scheduler will parse only files containing
dag in the code! Yes, you’ve heard this right! If a file under the DAG folder does not contain at least one of these two keywords, it will simply not be parsed by the scheduler.
If you want to modify this rule such that this is no longer a requirement for the scheduler, you can simply set
DAG_DISCOVERY_SAFE_MODE configuration setting to
False. In that case, the scheduler will parse all files under your DAG folder (
I wouldn’t recommend disabling this check though, since doing so doesn’t really make any sense. A proper DAG file will have Airflow imports and DAG definition which means the requirements for parsing that file are met) but it is worth knowing that this rule exists.
2. Variables with certain keywords in their name have their values hidden
We know that by default, Airflow will hide sensitive information stored in a Connection (and more specifically in the
password field), but what about Variables?
Well, this is indeed possible and the mind blowing thing is that Airflow can do this automatically for you. If a variable contains certain keywords, that can possibly indicate sensitive information, then its value will automatically be hidden.
Here’s a list of keywords that will make a Variable qualify for having sensitive information store as…