Frameworks and Librairies
What is a programming framework and why to use one ?
In data science, a programming framework is software that has been already developed that includes reusable functionality so that you could create your projects easily and faster. This is why it’s quite practical to use a framework.
What are the top frameworks and libraries to use for data science projects?
There are many frameworks available for data scientists to create truly first-class projects to turn any data science idea into reality. And machine learning frameworks can automate processes to boost many projects. Here are the best ones you might want to consider.
Tools for Data Cleaning and Manipulation
Pandas is a library for data manipulation and analysis. In particular, it offers data structures and operations for manipulating tables and time series.
- Multiple files type supported
- Great handling of data
- Handling missing data
- Merging and joining datasets
- Group by
- Reshaping and pivoting of data sets.
- Slicing, indexing, and subsetting of datasets.
- Data structure column and rows manipulation.
- Time series-functionality: Date range, frequency conversion, moving window statistics, moving window linear regressions,…
NumPy is a library for faster computation as long as most operations are done on arrays and matrices, and a large set of high-level mathematical functions to operate on these arrays.
- Performance of operations on N-dimensional arrays
- Variety of mathematical operations such as the Fourier transform
- Multidimensional containers
- Maintain minimal memory
- Spreadsheet (allows operations with arrays of different sizes).
- NumPy can integrate features available in various programming languages.
- Image manipulation and processing
PySpark is the collaboration of Apache Spark and Python. Apache Spark is a cluster computing framework based on speed, ease of use, and in-memory computing. It is very useful when it comes to processing, querying and analyzing large data sets.
- Real-time computations thanks to in-memory computation which results in low latency.
- PySpark is a library that allows you to process large amounts of data quickly on a single machine or a cluster of machines (parallel processing).
- Universal language when it comes to processing huge data sets (compatible with Scala, Java, Python, …)
- Reduced complexity of data partitioning and task management as it is managed automatically thanks to the concept of resilient distributed dataset (RDD).
Tools for Data Visualization
Seaborn is a Python data visualization library built on top of Matplotlib. It provides an excellent interface for drawing attractive and informative statistical graphs. Seaborn plays an important role in data exploration and analysis. The library is very useful for examining relationships between multiple variables.
- Large set of charts: Histogram / Scatter plot / Error Charts/ Pie Charts …
- A dataset based API allowing comparison between several variables.
- Reveals different types of patterns, a wide range of color palettes can be used.
- Allows to create multi-plot grids, which is ideal for building complex visualizations.
- Automatic estimation and plotting of linear regression
Matplotlib is the most popular plotting library in the Python ecosystem, used for data exploration and visualization. It offers infinite graphs, customizations and all other libraries are built on top of Matplotlib.
- A wide range of charts: Histogram / Scatter plot / Lineplot / Boxplot / Barchart / Violin plot …
- Provides an object oriented API to integrate graphs in applications.
Tools for Data Modelling
TensorFlow is a free, open-source software library for machine learning. It can be used for a variety of tasks, but focuses specifically on training and inferring deep neural networks. TensorFlow was developed by the Google Brain team for internal use at Google.
- Parallel training of neural networks, making the models very efficient for large-scale systems.
- Easy training on CPU and GPU for distributed computing.
- TensorFlow Flexibility and Modularity
- TensorFlow simplifies the process of visualizing each part of the graph.
- Dynamic models with Python control flow
- TensorFlow allows you to train your model regardless of the language or platform you use.
- Well documented, so easy to understand
- Visualization kit with TensorBoard
PyTorch is an open source machine learning library based on the Torch library, used for applications such as computer vision, natural language processing, and deep learning, primarily developed by Facebook’s AI Research Lab (FAIR).
Unlike most other popular deep learning frameworks, PyTorch uses dynamic computing, which allows for greater flexibility in building complex architectures
- Easy to use API
- PyTorch uses python integrations coupled with a data science stack.
- It allows to create computational graphs when needed.
- Tensor computing with high GPU acceleration
- Automatic differentiation for building and training neural networks.
The SciPy library includes a number of modules for integration, linear algebra, optimization and statistics, built on the NumPy library seen earlier.
- Interaction with NumPy
- Indexing tips
- Manipulating shapes
- Vectorization functions
- Special function
- Linear algebra operation
- Optimization and adjustment
- Statistics and random numbers
- Numerical integration
- Fast Fourier transforms
- Signal processing
- Image manipulation
A fairly similar popular library in terms of statistical applications is Statsmodels, widely used for statistical testing.
Scikit-Learn is considered one of the best Python libraries for working with complex data. Scikit-Learn builds on the Matplotlib, NumPy and SciPy libraries. The machine learning library provides a range of simple but effective tools for performing data science and mining tasks.
It was initially developed by David Cournapeau as part of a Google summer project. Later, in 2010, Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort and Vincent Michel, from INRIA (French Institute for Research in Computer Science and Control), took this project to another level and made the first public version..
- Ability to extract features from images and text
- Reusable in multiple contexts
- Several methods to verify the accuracy of supervised models on unobserved data
- Algorithms for supervised and unsupervised machine learning
- Standard API and python interface
- Data sets
- Feature extraction
- Feature selection
- Parameter tuning
- Supervised models
- Unsupervised models
- Dimensionality reduction
- Ensemble methods
The Natural Language Toolkit (NLTK) is a suite of libraries and programs for symbolic and statistical natural language processing (NLP), including empirical linguistics, cognitive science, information retrieval and machine learning. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania.
- N-gram and collocations
- Named-entity recognition
- Supports lexical analysis
- Frequency Distribution
- Lexicon Normalization
- Sentiment Analysis
- Text Classification
- Feature Generation using TF-IDF
Keras is an open-source software library that provides a Python interface for artificial neural networks. Keras acts as an interface for the TensorFlow library. It was developed as part of the research effort of project ONEIROS (Open-ended Neuro-Electronic Intelligent Robot Operating System), and its primary author is François Chollet, a Google engineer.
- Focus on user experience.
- Multi-backend and multi-platform.
- Easy production of models
- Allows for easy and fast prototyping
- Convolutional networks support
- Recurrent networks support
- Keras is expressive, flexible, and apt for innovative research.
- Highly modular neural networks library written in Python
- Developed with a focus on allows on fast experimentation
The above list is not an exhaustive list , but a subjective one based on my own experience and the exchanges I had with other Data Scientists. The goal of this article is to help data analysts and data scientists starting their career be aware of the functions and libraries at their disposal to perform well on the job. Python is the best fit in terms of tool in data science due to the wide range of libraries.