- DSF 2020: An Economist’s perspective on feature engineering with Feature-engine
The 2020 Data Science Festival kicked off yesterday with a talk by Soledad Galli on feature engineering with Feature-engine, an open-source python library to assist with the streamlining of feature engineering pipelines. The library was created by Train In Data as well as a number of external contributors, and boy does it look dreamy.
Coming from a Stata/R background (shout out to all my Economist colleagues moving into Data Science!), but now getting more accustomed to Python, at times feature engineering in Python feels laborious. Feature-engine seems to resolve this problem and more.
In her talk Soledad mentioned a number of challenges related to feature engineering:
- It is time consuming and repetitive: Most code is written manually and there’s no getting around that since libraries like scikit-learn will not work without the right variable formats or appropriate treatment of missing values. Also, bits of code tend to be written and rewritten but not always in the exact same way. Even with really strict version control, it’s sometimes really difficult to guarantee scripting consistency especially when you have a large team or multiple teams.
- Keeping your config and param files up to date when you edit or improve the model or features is not very efficient when everything is constructed manually. This ties in with the point made above. If you’re having to manually edit code because you are iteratively improving how your parameters are set up or how the model is set up, you will then need to ensure that you update your configuration files. This isn’t particularly efficient and may have implications on reproducibility…
- Reproducibility is king and insufficient consistency makes this very difficult. As Soledad pointed out, the code that you make in the research stage will likely look different to the code that you have in your production pipeline – “more than one way to skin a cat” really rings true here. In the research stage, you’re probably less concerned about speed or efficiency – you write code to give you the intended output as quickly as possible. In the production stage, speed and efficiency of code become really important so code is written to be as optimal as possible and what worked in your research environment (usually Jupyter notebook) might not translate so well in your production environment.
Soledad said that deploying feature-engine in your feature engineering process helps to resolve most of these issues. For one, it allows you to standardise your engineering procedures, so how you deal with say outliers or missing values can be consistent across projects and teams. It also significantly reduces the amount of scripting required on the part of the Data Scientist/Analyst. The reason why feature engineering in Python is such a shock to the system for those coming from Stata is because we are used to black-box functions that are preconstructed and are accepted as the correct procedure for performing a desired transformation. Feature-engine seems to align Python more closely to Stata – ultimately trading off control with engineering procedure standardisation.
In addition to this, Feature-engine allows you to take in and transform dataframes and then spit out an appropriate dataframe suitable for exploration, production or deployment. Combine this with the fact that Feature-engine is compatible with the Scikit-learn pipeline, meaning that you can eliminate the need to reproduce code to be production suitable, and it would seem that Feature-engine could be a real game changer. In fact, Udemy’s Feature Engineering for Machine Learning course even has a few lectures on dealing with certain engineering procedures using Feature-engine!
As the session was on the shorter side, unfortunately we didn’t get a chance to look at how the process of feature engineering looked with feature engine compared to without. I was a little disappointed in this, so for my next blog I’ll be running through a side by side example of feature engineering with feature engine vs without.
In the meantime, have a look at these helpful links on all things Feature-engine and Train In Data:
- GitHub repo for Feature-engine
- Soledad’s blog on Feature-engine
- Feature-engine documentation
- Train In Data thorough run through of feature engineering with Feature-engine
I would encourage readers of this blog to have a look at the examples and links and let me know what you think of Feature-engine! To catch some of the talks at this year’s Data Science Festival, follow this link.