3,130 total views, 7 views today
If you’re interested in a career in data, and you’re familiar with the set of skills you’ll need to master, you know that Python and R are two of the most popular languages for data analysis. When it comes to data analysis, both Python and R are simple (and free) to install and relatively easy to get started with. If you’re a new to the world of data science and don’t have experience in either language, or with programming in general, it makes sense to be unsure whether to learn R or Python first.
The purpose was to develop a language that focused on delivering a better and more user-friendly way to do data analysis, statistics and graphical models. At first, R was primarily used in academics and research, but lately the enterprise world is discovering R as well. This makes R one of the fastest growing statistical languages in the corporate world.
One of the main strengths of R is its huge community that provides support through mailing lists, user-contributed documentation and a very active Stack Overflow group. There is also CRAN, a huge repository of curated R packages to which users can easily contribute. These packages are a collection of R functions and data that make it easy to immediately get access to the latest techniques and functionalities without needing to develop everything from scratch yourself.
When and how to use R?
R is mainly used when the data analysis task requires standalone computing or analysis on individual servers. It’s great for exploratory work, and it’s handy for almost any type of data analysis because of the huge number of packages and readily usable tests that often provide you with the necessary tools to get up and running quickly. R can even be part of a big data solution.
When getting started with R, a good first step is to install the RStudio IDE. Once this is done, we recommend you have a look at the following popular packages:
- Dplyr, plyr and data. Table to easily manipulate packages,
- string to manipulate strings,
- zoo to work with regular and irregular time series,
- ggvis, lattice, and ggplot2 to visualize data, and
- caret for machine learning
Why R is Great for Data Science
- Collecting Data
Readr (Reimplements read.csv into something better)
- csv is bad because it takes strings into factors, it’s slow, etc.
- Creates a contract for what the data features should be, making it more robust to use in production
- Much faster than read.csv
- Data Visualization
ggplot2 (ggplot2 was recently massively upgraded)
- Recently had a very significant upgrade (to the point where old code will break)
- You can do faceting and zoom into facets
HTML widgets (Reusable components)
- Provides a fantastic gallery you can borrow from others.
Tilegramsr (Proportional maps)
- Create maps that are proportional to the population
- Makes it possible to create more interesting maps than those that only highlight major cities due to population density
- Cleaning & Transforming Data
Dplyr (Swiss army chainsaw)
- The way R should’ve been from the first place
- Has a bunch of amazing joins
EzdataMunch developed an extension in ‘R’ which helps in generating statistical charts without any programming or coding. And its completely integrated with Qlikview and Qlik sense.
For more information on EzdataMunch ‘R’ Extension.
Please Visit: R Package Integration Solution
Python was developed to emphasize productivity and code readability. Programmers that want to develop into data analysis or apply statistical techniques are some of the main users of Python for statistical purposes. Like R, Python has packages as well. PyPi is the Python Package index and consists of libraries to which users can contribute. Just like R, Python has a great community, but it is a bit more scattered, since it’s a general-purpose language. Nevertheless, Python for data science is rapidly claiming a more dominant position in the Python universe: expectations are growing, and more innovative data science applications will see their origin here.
Why Python is Great for Data Science
- Collecting Data
Feather (Fast reading and writing of data to disk)
- Fast, lightweight, easy-to-use binary format for filetypes
- Makes pushing data frames in and out of memory as simply as possible
- Language agnostic (works across Python and R)
- High read and write performance
- Great for passing data from one language to another in your pipeline
Ibis (Pythonic way of accessing datasets)
- Bridges the gap between local Python environments and remote storages like Hadoop or SQL
- Integrates with the rest of the Python ecosystem
ParaText (Fastest way to get fixed records and delimited data off disk and into RAM)
- Integrates with Pandas for csv reading.
- Enables CSV reading of up to 2.5GB a second.
- A bit difficult to install.
bcolz (Helps you deal with data that’s larger than your RAM)
- Compressed columnar storage
- You can define a Pandas-like data structure, compress it, and store it in memory
- Helps get around the performance bottleneck of querying from slower memory
- Data Visualization
Altair (Like a Matplotlib 2.0 that’s much more user friendly)
- You can spend more time understanding your data and its meaning.
- Create beautiful and effective visualizations with a minimal amount of code.
- Takes a tidy Dataframe as the data source.
- Data is mapped to visual properties using the group-by operation of Pandas and SQL.
- Primarily for creating static plots.
Bokeh (Reusable components for the web)
- Interactive visualization library that targets modern web browsers for presentation.
- Able to embed interactive visualizations.
- js for Python, except better.
Geoplotlib (Interactive maps)
- Extremely clean and simple way to create maps.
- Cleaning & Transforming Data
Blaze (NumPy for big data)
- Translates a NumPy / Pandas-like syntax to data computing systems.
- The same Python code can query data across a variety of data storage systems.
- Good way for your data transformations and manipulations.
xarray (Handles n-dimensional data)
- N-dimensional arrays of core pandas data structures (e.g. if the data has a time component as well).
Dask (Parallel computing)
- “Big Data” collections like parallel arrays, dataframe and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments.
Choosing Between Python and R:
You can use Python when your data analysis tasks need to be integrated with web apps or if statistics code needs to be incorporated into a production database. Being a fully-fledged programming language, it’s a great tool to implement algorithms for production use.
While the infancy of Python packages for data analysis was an issue in the past, this has improved significantly over the years. Make sure to install NumPy /SciPy (scientific computing) and pandas (data manipulation) to make Python usable for data analysis. Also have a look at Matplotlib to make graphics and scikit-learn for machine learning.
R and Python: The Data Science Numbers
- At recent polls that focus on programming languages used for data analysis, Python often is a clear winner. If you focus specifically on Python and R’s data analysis community, a similar pattern appears.
- There are signals that more people are switching from R to Python. There is a growing group of individuals using both languages together when appropriate.
- If you’re planning to start a career in data science, you are good with both languages. Job trends indicated an increasing demand for both skills.
- Jobs for Data Scientist in Python is more compared to R and have increased rapidly over the last few years.
As you can see, both languages are actively being developed and have an impressive suite of tools already. If you’re just starting out, one simple way to choose would be based on your comfort zone. For example, if you come from a C.S./developer background, you’ll probably feel more comfortable with Python. On the other hand, if you come from a statistics/analyst background, R will likely be more intuitive. We more often prefer to use Python. Python is a general-purpose programming language, making it possible to do pretty much anything you want to do.
Shubham is working as a Python and Django developer at EzDataMunch. He is also involved in the Development / Enhancement of Qlik Sense and QlikView extension.