Why Do This?
Like it or not, there are a lot of Windows shops out there. However, many super useful tools are Linux native and don’t really bridge into the Windows world (for good reason). That being said Windows and Microsoft (I think) are acknowledging this gap and have provided the Windows Linux Subsystem as a bridge between these two dominate platforms. This allows us little people to take advantage of some of the high powered tools available.
Special Thanks
This post stands on the back of a lot of really well written articles which I will link to. Additionally, the code here is good as of Feb 2020 so things are always subject to change.
Start with WLS
Assuming you are on a Windows box, you’ll need to enable Windows Linux Subsystem. This will allow you to run a lightweight virtual machine that also has access to your root machine file system if you so choose. The instructions are pretty decent at this link, but here are the high points.
Enable WSL
You need to enable your features via Powershell in administrator mode. If you are on a server or VM, you can do this by adding the features using the usual methods by selecting the additional features.
Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux
Get the Distro
Now you can download your distribution of choice (I like Ubuntu 18.04). There is some weirdness with some of the versions, so be careful what you choice. For instance when I installed Ubuntu 16LTS I could not upgrade to Python3.6, which I needed, so I had to burn it down and install another.
Invoke-WebRequest -Uri https://aka.ms/wsl-ubuntu-1804 -OutFile Ubuntu.appx -UseBasicParsing Add-AppxPackage .\Ubuntu.appx
Burn it Down? Huh?
So this is a little aside. Sometimes you mess up and the instal doesn’t go as expected. It is surprisingly hard to remove an old distribution and restart (though some tutorials say that it is). Here is a trick, again in powershell
wslconfig /l
Which will allow you to see what is installed. If you want to reset your distribution, which I had to do, do the following:
wslconfig /u Ubuntu18-04
This will unlist the distribution and you can install it afresh, otherwise it will continue to live on in your system.
Now Start Ubuntu
Assuming your install went ok, now go to the start button (or whatever it is called now) and type “Ubuntu”. You should see the familiar logo. Select it and it will run. You’ll also have to name yourself (which can be anything you want).
As always a good practice it to give your new Ubuntu instance an update and upgrade:
sudo apt updade && sudo apt upgrade
Additionally, and this is important, you want to automount your Windows directories. You can typically access them via /mnt/c/Users...
, but a lot of programs don’t like this and won’t run from a mounted drive. because of this you will need to update a configuration file:1
sudo nano /etc/wsl.conf
Then type the following into the configuration file:
[automount]
root = / options = "metadata"
Now write out the file CTRL + O
, then exit with CTRL + X
Now if you type the following, you should see your C drive listed in the directories:
ls /l
Postgres
Postgres is an open source database, which works great with Airflow and is relatively easy to use. Aiflow will run out of the box on a local sqlite database, but it always seems to get corrupted and have IO errors which are hard to depug.
Installing Postgres
In the Ubuntu terminal type the following to get postgres2
sudo apt-get install postgresql
The user account for postgres is postgres
(easy enough to remember)
You’ll need to make a password by the following:
sudo passwd postgres
Then close Ubuntu and open it back up. Type the following to see if the server is running:
sudo service postgresql start
Create a Postgres User and Database
First we’ll create a user for the postgres database. This is how Airflow will connect:
sudo -u postgres createuser airflow
sudo -u postgres createdb airflowdb
Now you want to jump into the postgres database that you started and provide rights and a password to airflow
sudo -u postgres psql
Now in the pqsl area (looks like psql=#)
alter user airflow with encrypted password 'securepassword';
grant all privileges on database airflowdb to airflow ;
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO airflow;
Now you can hit CTRL + Z
and exit the postgres prompt. Congrats, you now have an airflowdb postgres database.
To verify that everything went as expected, again in the command line type:
psql -d airflow
Then in the airflow prompt type:
\conninfo
Note the port that it is serving on (typically 5432, but could be something different for you). Exit out back to the bash terminal.
Config for Postgres
Now, thanks to this post we will update some configuration files so that Airflow and postgres can communicate.
First we need to change the pg_hba.conf file. Note that you may need to change your path depending on what version of postgres you have.
sudo nano /etc/postgresql/10/main/pg_hba.conf
Now that you have that file open, chage the IPv4 connections to as follows:
host all all 0.0.0.0/0 trust
Write out those changes and then exit. Now restart the server with:
sudo service postgresql restart
Now one more configuration file change, this time at:
sudo nano /etc/postgresql/9.5/main/postgresql.conf
Here change the listen_addresses
field to look like that below:
listen_addresses = ‘*’ # for Airflow connection
Then restart the server once more:
sudo service postgresql restart
Now Airflow
Wow, we went through a good bit just to get here. Worry not though, almost done.
Change Airflowhome
First we need to make sure that Airflow goes to a real drive on the local machine and not in the WLS. To ensure this happens, I like to add the path to my .bashrc file.
sudo nano ~/.bashrc
Find an empty line and include the line below, changing the “user_name” for whatever username you have. Important here is that you have write access and know that the dags for airflow will be in this case under the AirflowHome directory on your C drive.
export AIRFLOW_HOME=/c/Users/user_name/AirflowHome
Write out this change and then exit with the normal keystrokes. Additionally, source your bashrc file or exit Ubuntu and re open it.
Python?
It is important to make sure that we have a version of Python 3.6 or newer. This ships with Ubuntu 18-04, but worth checking with the following:
python --version
If we don’t have python then we will need to install it with:
sudo apt get python3
If we do have it, we need to verify that we have pip
pip3 --version
If not, install it as above.
Install Airflow (finally)
Now that all the pre-reqs have been accomplished we can install airflow with the following:
pip3 install apache-airflow[postgres, mssql, celery, rabbitmq]
This will install Airflow as well as all the packages that support connecting to postgres, MS Sql and running multiple operators. Depending what packages you have, this might take a bit.
Configuring Airflow
Once everything is installed, we need to update the configuration for airflow to point it to our prefered destinations.
We an do this with:
sudo nano ${AIRFLOW_HOME}/airflow.cfg
(This also tests that our path was successful).
Once there we need to alter two fields.
- Update our database to our fancy postgres instance as shown below:
sql_alchemy_conn = postgresql+psycopg2://airflow:secure_password@localhost:5432/airflowdb
- Change our executor to:
executor = LocalExecutor
In theory we should be able to use the CeleryExecutor, but I wasn’t sucessful with it and didn’t want to fight that anymore.
You should scan through this configuration file to see if everything is correct and if you want to make any changes. For example you might want to change the timezone as the author of this post did
default_timezone = America/New_York
Serving it Up
Now you can exit the configuration file by writing out your changes.
Now you can start serving Airflow with the following:
airflow initdb
Now the database has been initialized. Next step is to start the scheduler:
airflow scheduler
This will start the scheduler (and it might take up the whole window with logs, so just open another instance of Ubunutu).
Then start the webserver:
airflow webserver
Again, this might take another terminal over completely, but don’t worry. If everything worked, you should be able to navigate to localhost:8080 and see your Airflow instance running in your favourite browser.
This will have the default dags loaded by default, so you should see several dags. Turn one of the tutorials one, then manually trigger the DAG to see how it works.
Running A Dag on the Local Machine
The whole prupose of Airflow is to schedule tasks. Ideally, you can use all of your windows credentials, Active Directory, looking at you and manipulate files on shared drives. Here’s where it gets tricky; you will call your Windows host machine to run your tasks in your DAG files.
An Example
I manage my dags through a Git repository. This provides me all the power of version control and lets other users submit PRs with new DAGs. If need to make changes, I just push to my local repository with my new dag or changed dag. Airflow then pulls this repository on a schedule, checks them, and turns any shell scripts into executables.
# -*- coding: utf-8 -*-
from builtins import range
from datetime import timedelta
import airflow
from airflow.models import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.dummy_operator import DummyOperator
= {
args 'owner': 'michael',
'start_date': airflow.utils.dates.days_ago(2),
}
= DAG(
dag ='pull_scheduler_updates',
dag_id=args,
default_args='2 4 * * *',
schedule_interval=timedelta(minutes=20),
dagrun_timeout
)
= BashOperator(
update_repo ='update_repo',
task_id='cd /c/Users/mike/AirflowHome; /c/Program\ Files/Git/cmd/git.exe checkout -f; /c/Program\ Files/Git/cmd/git.exe pull origin master; ',
bash_command=dag,
dag
)
= BashOperator(
convert_unix ='convert_to_unix',
task_id='cd /c/Users/mike/AirflowHome/src; find . -type f -print0 | xargs -0 dos2unix; ',
bash_command=dag,
dag
)
= BashOperator(
make_exec ='make_exec',
task_id="cd /c/Users/mike/AirflowHome/src; ls -l | awk '{k=0;for(i=0;i<=8;i++)k+=((substr($1,i+2,1)~/[rwx]/) *2^(8-i));if(k)printf(\"%0o \",k);print}'; ",
bash_command=dag,
dag
)
>> convert_unix
update_repo >> make_exec
convert_unix
if __name__ == "__main__":
dag.cli()
The important part here to note is that I am changing directory to a place on my C drive, then calling Windows Git explicitly from my host system files. This allows my Windows machine to handle this process and not count on Ubuntu WLS to execute these tasks.
Similarly, if I had a shell script that looked like the following:
echo "starting"
echo $(pwd)
# Set Location
cd /c/Users/mike/cool_project
/c/Program\ Files/R/R-3.6.1/bin/x64/Rscript.exe really_long_script.R
echo "Done!"
This would allow me to call R directly on the host machine to run my long script. Ideally, I would set common system commands as alias (eg. Rscript = /c/Program Files/R/R-3.6.1/bin/x64/Rscript.exe or something like this). I might go down that road, but not right now (or I will put it at the top of the bash script and have templates).
echo "starting"
echo $(pwd)
Rscript = /c/Program\ Files/R/R-3.6.1/bin/x64/Rscript.exe
# Set Location
cd /c/Users/mike/cool_project
$(Rscript) really_long_script.R
echo "Done!"
Creating variables with the path names is what I have done in Makefiles (because I am old school and love GNUmake). In this way I can still navigate to a directory and issue the make
command in the DAG.
Rmarkdown
One thing to consider when dealing with RMarkdown is that you need to explicitly define the path the pandoc when the script is being run by Airflow.
Sys.setenv(RSTUDIO_PANDOC="C:/Program Files/RStudio/bin/pandoc")
This typically my scripts for R being run by Airflow will look like:
echo "starting"
echo $(pwd)
Rscript = /c/Program\ Files/R/R-3.6.1/bin/x64/Rscript.exe
# Set Location
cd /c/Users/mike/cool_project
$(Rscript) -e 'Sys.setenv(RSTUDIO_PANDOC="C:/Program Files/RStudio/bin/pandoc"); rmarkdown::render("cool_report.Rmd")'
echo "Done!"
Issues…
Sometimes Airflow will die or you will submit a bad DAG and get things messed up. Don’t worry, there are plenty of tricks:
- Reset your database
airflow resetdb
This will clear out and reset your database. For extra reset power, go ahead and use the airflow initdb
.
You might need to start the server and the scheduler again as well.
- Kill orphan processes. I have found that if something fails while Airflow is running a process, it can get hung up and won’t resart correctly.3
When that happens you might need to kill the processes manually, then restart the airflow database, scheulder, and webserver.
ps -aux | grep scheduler
Will list the processes. Scan them and see if anything needs to be killed as shown below.
kill -9 3400
Update DAGs. If you don’t automatically update the DAGs as I have done and are waiting for a new DAG to appear, you can try the following on the bash shell:
python3 -c "from airflow.models import DagBag; d = DagBag();"
Conclusion
Airflow is great and I love it. You will have to learn how to use it with Windows, but the pain is worth it for the convenience and transparency of your job.
Footnotes
Learned this great tidbit here https://www.astronomer.io/guides/airflow-wsl/↩︎
Special thanks to this blog post https://medium.com/@harshityadav95/postgresql-in-windows-subsystem-for-linux-wsl-6dc751ac1ff3↩︎
See this post https://stackoverflow.com/questions/59080796/airflow-scheduler-periodically-complains-no-heartbeat for an example↩︎
Reuse
Citation
@online{dewitt2020,
author = {Michael DeWitt},
title = {Airflow on {Windows} {Linux} {Subsystem}},
date = {03/04/2020},
url = {https://michaeldewittjr.com/programming/2020-03-04-airflow-on-windows-linux-subsystem},
langid = {en}
}