Prediction Models: Data Science Theory

Predictive modelling is the process that uses data mining (the process of finding similar patterns in the data), and finds the probability of any occurrence to produce output. In simple words, forecasting based on various methods.


It uses statistics to predict the results. Basically predictions are used to predict the future outcomes, but it can be applied on unknown data and where occurrence of time doesn’t matter.

Models are chosen on the basis of Detection theory(ability to differentiate between informative pattern and random pattern from the data).

The model is selected on the basis of testing , validation using the Detection theory to find the probability of any output of given input.

Predictive Modeling Process:

Modelling processes involve many algorithms and models and this process is iterative in nature, using multiple models or algorithms on the same data gives the best fit model of the data.

Categories of Models:
1. Predictive model
In this model, it uses past predictions for analyses and uses them for future prediction.
2. Descriptive Model
It defines the relationship between various entities of the data used.
3. Decision Model
This is used for making decisions over particular conditions and this is a repeatable approach which can be used again and again.

Exponential Smoothing:

It is a part of a time series where older data has less weight( less priority) as compared to new data and new data is more relevant and has more weight.
Smoothing constant represented by α ( it represents the weight of observation).
This is used for short term forecasting and is used in Tableau as well.

Types of Exponential Smoothing:

  1. Simple exponential smoothing
  2. Double exponential smoothing
  3. Triple exponential smoothing

Simple exponential smoothing

Arima Model:


It stands for Auto-regresive Integrated Moving Average .
Arima model belongs to the class of statistical models for analyzing and forecasting time series.
There are two types of models used for forecasting time series are:
1. Seasonal
2. Non- seasonal
ARIMA model is used for time series forecasting.

A fellow mate helped me write this article. Thanks to her.

Thank you for the read 🙂

Tableau: Business Intelligence Tool

Tableau is a Business Intelligence tool used for beautiful visualizations and analyzing the data. This is tool is easy to operate once you get a hand of it and its all of the features including creating animation. Adding Tableau in your skills is a plus. Various companies add experience of any BI tool to the required skills for Business Analyst, Data Analyst, or Data Scientist jobs.

Tableau is used to analyze the data for businesses. To create stories using visualizations. Help in figuring out the trends. My favorite application of Tableau is its forecasting. It uses Exponential smoothing model to predict the future number of any data you are using by taking the average of the previous values. It is somewhat accurate but not fully obviously.

It lets you take the dimensions and measurement variables to plot.

It also lets you add a workbook, rename a workbook, save and delete workbooks along with sharing any working using the online tableau version for online sharing.

I am using Tableau desktop version and also online one. You can download it as a professional, or as a student for free for a year. Check the website here for more: Tableau.

This is what Tableau looks when open, in the vertical blue bar, you can add the data of the type you have to get started or can open an existing workbook.

You can also connect to server mentioned there or open the saved data sources again. For other help, look in the right “Discover” Panel.

After adding the data, it will open a window like this:

I’m adding a data downloaded from kaggle about the world_bank in csv format.

Let’s go to worksheet mentioned in the screenshot to do analysis and check its working.

Adding a country name(Dimension) in the Marks palette, it will automatically pick a plot from the side bar to represent the data chosen.

We can add other dimensions, measures, and change the color however we want. I also like when it helps you choosing the plot in the side bar by telling you the required dimensions, measures to make a graph.

I suggest you should try experimenting with this to learn new and realize its ability to analyse and perform various interactive tasks efficiently.

Thank you for the read 🙂

Algorithms in Data Science: A must to know

Algorithms as we know are the basics to know to be in any computer science topic. Algorithms plays a part in making your foundation about the field stronger. In Data Science, it is important to have an idea about the algorithms that are being used for analysis, prediction, and other various tasks. There are many algorithms but let’s start with the following ones:

  1. Linear Regression
  2. Logistic Regression
  3. K-means clustering

Data science involves regression techniques, classification, and clustering, etc. These above mentioned algorithms are one of the types.

  1. Linear Regression:

In this, linear relationship is the one between two or more variables where we try to predict dependent variable value(y) based on the independent variable value(x). If we draw this on a two-dimensional space we will get a straight line.

Linear regression continuous value outputs. It uses two variable(2-D plane):

Independent variable: x

Dependent variable: y

The equation of the line formed in the graph is:

y = mx + c

where m is the slope of the line and c is the intercept. It gives us a straight line by considering the best fits in the plot.

2. Logistic Regression:

Logistic regression is a binary classification method. It uses a logistic function also known as sigmoid function which was created by statistician to predict and describe properties of a system. It’s an S-shaped curve that takes any real-valued number and fit it into a value between 0 and 1(binary values), but not exactly at the limits(0 or 1).

1 / (1 + e^-x)

where e is the base of the natural algorithms or Euler’s number and x is the numeric value that you input to transform.

y = e^(b0 + b1*x) / (1 + e^(b0 + b1*x))

This is the equation for logistic regression where y is the output, b0 is the bias or intercept term, and b1 is the coefficient for the input value(x).

3. K-Means Clustering:

It is an unsupervised learning algorithm. K-means clustering is an iterative algorithm that partitions a group of data containing n values into k subgroups is the real definition of k-mean clustering. This algorithm is used to reduce the distance between the points that has from the centroid of the cluster.

K-mean clustering problem can be solved by either Lloyd’s or Elkan’s algorithm.

This uses squared error function to minimize the distance:

Where J is the objective function of the centroid of the cluster, K are the number of clusters and n are the number of cases, C is the number of centroids and j is the number of clusters. X is the given data-point from which we have to determine the euclidian distance to the centroid as shown in the figure. 

Images are taken from the web just for reference.

Other algorithms to know:

Support Vector Machine

Principal Component Analysis

Artificial Neural Networks

Decision Trees

Thank you for the read 🙂

SciPy Operations: Linear Algebra and Polynomials

Scipy is a python library that extends the use of numpy and makes use of matplotlib and numpy both. It uses numpy arrays as its data structure. It is being used in scientific programming, integration, derivation, and Fourier transforms, etc. Today we will try to perform linear algebra and polynomial functions using scipy and numpy.

As you know the drill by now, we first need to download both of the libraries using pip if it’s installed or using python 3 (where pip comes in-built).

>>pip install scipy

>>pip install numpy

To import:

importing lib

For linear algebra, we need to import another package that can help us in performing various operations in it. It is known as linalg. 

To import linalg: 

imp_linalg

Next, have to define the arrays using the module to solve these linear equations.

scipy_arrays

The equations are:

2x + 3y = 7

3x + 4y = 10

you can try other operations as well.

Matrix computations: 

To find the determinant of the matrix:

scipy_determinant

The inverse of a matrix using inv():

scipy_invmatrix

Working with polynomials: 

To use polynomial functions, we need to import the poly1d module from numpy.

Defining the values in the form of a polynomial as ax + bx + c.

scipy_poly

So, in this snippet, the equation is x + 2x + 3.

In block number 18, the value of the polynomial at x = 7 is 66.

We can also perform and plot graphs using scipy along with other computations.

That’s it for today, I hope it was easy to understand. For more, you can contact me via the form on this website. I’d be happy to share some resources.

Thank you for the read 🙂 

 

Working with Matplotlib: Plotting various graphs.

Matplotlib is a python library used for making graphs and plotting visualizations using pandas and numpy for datasets. It helps us in creating bar graphs, scatter plots, subplots, histograms, etc. It also lets us add axes properties, font styles, and line styles to the graphs.

To download, use this command in your terminal:

pip install matplotlib

To import this library to run:

imp_plt

Let’s start with the basic plotting of graphs.

plot() method to plot numbers on a graph:

plt.plot

we can also add two parameters with defining the color as follows:

You can add different colors by visiting this site: matplotlib colors

plt.color

Defining the x-axes and y-axes in a graph using xlabel() and ylabel():

plt.xlabelplt.ylabel

Now to draw bar graphs, scatter plots we have to define the values along with the figure size we want to use:

plt.declare

Bar Graph: To plot a bar graph using the above information, we can simply use the bar() method.

plt.bar Scatter plots: For scatter plots, use scatter().

plt.scatter

We can also create subplots but we have to be very careful while making one because it needs a numeric value to create a subplot within a range otherwise it will throw a value error.

To plot the 2-dimensional plot by using the plot() method.

plt.plt1

I hope it was easy to understand. I will try to add more technical use of these libraries later. Stay tuned.

Thank you for the read 🙂

 

Working with Pandas: Python Library for Data Analysis

Pandas is a python library that is used in manipulating and analyzing the data. Its main data structure is a DataFrame. It can read files of CSV, XLSX, and TXT formats. We can also use Pandas with Numpy and matplotlib to make it more applicable. As mentioned above, it helps in data analysis, and it involved other processes well.

Today, we will learn to create data frames, indexing of df, slicing of df, and how to read and manipulate CSV files using pandas.

First to import pandas type: 

imp_pd

Creating a DataFrame: 

pd_dfcreating

Indexing of a dataframe: Though the index will start from 0 itself, we can manipulate and alter the however we want.

pd_indexing

For indexing a column:

pd-foracol

Slicing of a dataframe: We can slice the desired data from a dataframe the way we do slicing in python.

pd_slicing

To Read CSV files: I’m using a dataset of world bank which I used in my previous projects. You can also download datasets for free through the UCI machine learning repository or on Kaggle.

Command:

file = pandas.read_csv(‘filename.csv’)

readcsv

To get the number of columns

 filename.shape[1]

To get the number of rows:

filename.shape[0]

You can also get the column, row names, can do slicing among them, and can find datatype being used in the dataset for a particular row and column.

head() function will give you the first five rows of the data and we can insert a parameter in the head() to choose a number to display. tail() function will give the rows from the bottom of the data, here also you can choose the number to display.

That’s it for today, I hope it helped you in knowing how to use pandas and take the first step towards your learnings. All the best.

 

Thank you for the read 🙂

 

 

 

 

 

Working with NumPy: Array Operations

As explained in the last post about Numpy that it is a widely used python library used to perform high-level numeric computations using arrays. It has applications in Machine learning and data analysis.

So today I’m going to perform some basic arithmetic operations on the arrays such as add, sub, multiply, and divide, and also will make use of vstack, hstack, dstack functions to joining and stacking the arrays.

Addition can be done using the add function or by using the operator as shown in the code below, just make sure to keep the dimensions the same otherwise it will throw a value error for not being of the same values.

np_add

Subtraction can also be done by following the same as we did with the add function and operator:

np_sub

For Multiply:

multiply() function is being used.

np_mul

For Division: 

divide() function is being used.

np_divide

To get the remainder and mod of division, we can do so by simply using their functions which are remainder() and mod():

np_remnp_mod

Power() function: 

np_pow

Hence, these were the basic arithmetic operations to get started with the NumPy package of python. But it also includes Joining and stacking of similar arrays which make Numpy even more useful.

JOINING

Using concatenate() function.

np_joining

Append() function, to define the axis of your array:

np_vstack

STACKING

For stacking arrays, we use three functions which are:

vstack() for vertically stacking in the arrays along the rows.

np_vstack1

hstack() for horizontally stacking in the arrays along with the columns.

np_hstack

dstack() for in-depth stacking by the new third axis:

np_dstack

That’s it for now, it was more of a code than explanation. But I hope it helps because of its simplicity.

I might add my Jupyter notebook in pdf form to the resources in the upcoming posts.

Thank you for the read. 

 

Most Important Python Libraries: Data Science and ML

Hello, Readers! I’m trying something new to do in this pandemic time. My last posts and inactivity is going to be different this time. 

I am trying to make it more technical by starting blogging about the Data Science field which currently I am in with a year of experience. 

I will start from the basics and I know there are tons of other free resources to learn all this from but I will try to make it easier to understand based on my experience. 

Today, I will explain a little about the most important libraries to know to get started with Data Science and ML. And in upcoming posts, I might add the working with those libraries post as they hold the foundations of these two fields. I use Jupyter Notebook and Pycharm, They are very easy to download and install, I will add some reference links in the below for the newbies. 

So let’s get started.

1. Numpy: Numerical python is a library widely used to perform numerical computation in python. It makes use of multidimensional objects known as arrays and tools to perform arithmetic operations and visualizations.
You can simply make use of Numpy by downloading and installing it in your machine.

To download Numpy, you can simply type in your command line: 

         >>>pip install numpy (if you are using pip) 

             >conda install numpy (if using conda)

To install Numpy, I am using Jupyter notebook and adding my code block of installing numpy: 

imp_numpysc

Some applications and features: 

Used in Data Analysis, ML

Creates n multidimensional arrays

Faster computations with good efficiency

Github Repository 

2. Pandas: Python data analysis, pandas is the most popular and widely used library in Data Science. Pandas provide high-level tools and data structures to use. It is being used mostly for ETL, data wrangling, and data cleaning. It makes use of data frames.

You can simply make use of Pandas by downloading and installing it in your machine.

To download pandas, you can simply type in your command line: 

         >>>pip install pandas (if you are using pip) 

             >conda install pandas (if using conda)

To install pandas: 

imp_pandassc

Some applications and features: 

Highly used in data wrangling, cleaning, and making ETL processes.

Supports loading of CSV file format

Easy syntax and operations to perform.

Github Repository

3. Matplotlib: Matplotlib library is used to plot different types of graphs such as bar graphs, histograms, etc. It makes use of visualizing data in a presentable form. It provides an Object-oriented API to add those plots into an application. It is being used for Data Visualizations.

You can simply make use of matplotlib by downloading and installing it in your machine.

To download matplotlib, you can simply type in your command line: 

         >>>pip install matplotlib (if you are using pip) 

             >conda install matplotlib (if using conda)

To import matplotlib : 

imp_matplotlibsc

Applications in creating charts and plotting graphs, making beautiful visualizations.

Easy to use, and operate.

Github Repository

4. Scipy:  Scientific python is used to perform scientific computations and extends the functions of numpy. It is also used to do high-level arithmetic computations such as Fourier transform and in optimizing algorithms.

You can simply make use of scipy by downloading and installing it in your machine.

To download scipy, you can simply type in your command line: 

         >>>pip install scipy (if you are using pip) 

             >conda install scipy (if using conda)

To import scipy: 

imp_scipysc

Github Repository

 

That’s it for today and I might add TensorFlow and Scikit learn libraries as well later till then I hope this post gave you an overview of the libraries that you need to know to get started and I suggest you go deeper and check the working of each library listed above in the upcoming posts here on my blog.

Links to download: 

Jupyter lab

Pycharm

Python latest version 3

Feedbacks are welcome and appreciated 🙂 

 

 

 

 

 

Technical Events: Worth to attend!

WTM Event

Back in 2017, when I just started my technical career and interest in Computer Science, I attended this event from WTM Delhi lead by Chhavi P. Gupta, and I was so fortunate to spend a day with inspiring women in tech where I shared ideas, interact with women from different colleges of Delhi, talked with professionals on different topics, and witnessed panel discussion. WTM’s event is worth to attend as it involves a lot of activities, swags, pizza, and a lot of informative discussions on technology.

PhotoGrid_1492373120157

Panel Discussion

PhotoGrid_1492372970419

Group picture at the end of the event.

In my case, first I introduced myself, talked to women about international research papers, products who have potential to make a change, how we need to work to make the gender ratio equal, got to know some programs in STEM courses and so many things. Well, last but not least, I got some cool WTM tees, had so much fun and experienced things and it was worth attending.

FB Dev Circles Event

Facebook Developer Circles is one of the famous technical group in Delhi, it hosts technical events, workshops, and share a ton of opportunities to everyone. I first encountered this group in early 2017 and it has been growing continuously. When I attended, it was a technical event consisting of several tracks lined up like on bitcoins, discussion, tech trends and the growth of the community. Professionals were there to speak on the topic and to share some insights on technology.  

I asked questions related to the talks and shared my views on it. It was so much fun meeting new people and catching up with the fellows that you have met in other events. I shared my own knowledge to the people and to the ones who just started their career in tech.  Food and folks of these technical events are the best part, these events give you the chance to network and enhance your knowledge and help you in building your career.

I’ve also attended events of Google Developers Group New Delhi, Women Who Code etc which were really helpful. 

FBdc img


Group picture after the FB Dev Circle’s event.

Learn IT, Girl! Program 3rd Edition 2017-2018

Hi, this blog post is about my journey as a participant of Learn IT, Girl! 3rd Edition Program from Oct 2017- Jan 2018.
Introduction: First of all, let me give you the introduction of the program, Learn IT, Girl! is an organization that supports women in tech, they run a program for 3 months(Oct to Jan in my case) of code after selecting girls from all over the world to teach programming language. It’s applications begin in September and ends in October. For selection, you have to answer some questions like a programming language that you want to learn, the project idea that you want to develop and motivation etc. After submission of the application, you will get a confirmation if you are selected after some time and will get assigned a mentor to guide you through the journey.
It begins when I was surfing on the internet and encountered with this post of Lean IT, Girl! to apply to the program to take part in. I then submitted my application answering every question that was asked as a mentee(you can apply for a mentee or a mentor).
After some time I got a mail from the organization saying I am selected as one of the participants of the 3rd edition of their program and allotted a mentor to help you in learning. I reached out to my mentor and gave him the idea of my knowledge and project. We decided the timeline and the material to study and work. I started to work in my first week and earned the badges after accomplishing the tasks on the LITG platform that my mentor used to give me by solving every problem on GitHub and other platforms. I asked him the doubts to move ahead. For me being a full-time undergrad that was not easy to devote time but if you want to do something, you will do it even in the toughest situations.
Project: In these 3 months, I worked on a web-based application to track the activities of human-like scheduling, fixing a meeting, or other plans using Python and Django framework. I successfully implemented Django in my project. In this application, there were few tabs like login/register, schedules, categories and a feedback form to make this application efficient and user-friendly.

registeratn

phases

This program helped me in to be in coding and how to manage your time. It was truly an experience that was worth. After 3 months of working on a project, every participant has to show the work done in 3 months to evaluate the project. To get a certificate, you have to pass the evaluation of the final project. I gave them the repository link on GitHub where I added the code of the project and send them the presentation of the project implemented.
After evaluation, I passed the program and I was one of the developers to pass the 3rd edition. Soon I received the certificate of achievement. I am so proud that I’ve completed this edition and learned so many things. It kickstarted my technical career and my interest in the field of Computer Science and it’s applications.