Skip to content

GaabrielCoosta/Data__Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 

Repository files navigation

📈 Data Analysis

NPM NPM

  • At the end of this repository you will have learned data analysis using Python, Numpy, Pandas and Matplotlib libraries

Captura de Tela 2023-03-10 às 12 04 30

“Big Data” a huge amount of structured or unstructured data that floods a company on a daily basis and therefore cannot be processed by traditional data processing techniques. Interpreting these huge amounts of data is challenging and therefore, requires a serious professional who can process and interpret these data. It can provide viable solutions to companies by uncovering significant trends and insights

  • When we talk about Big Data, there are some hardware and storage challenges to perform it
  • When we talk about data, information and knowledge, we seek to add value to this data so that both the user and the company can have an added value that can go through both the intellectual and the financial aspects
  • Another challenge is data visualization. It allows us to learn from this data in a more user-friendly way through graphics. Through tables, we can generate graphs and make more accessible inferences, therefore, we can make smarter decisions. This helps us to generate more value with respect to data.
  • It is interesting to consider about big data, the speed with which data and information are generated is very important, so we need to provide technologies and devices capable of disseminating and dealing with this speed of data-related transformation.

Captura de Tela 2023-03-10 às 12 06 01

  • Data: something incomplete
  • Information: gathering data brings meaning
  • Knowledge: Conclusions and reflections built through the information, the understanding of information
  • Wisdom: what is done with the knowledge gained

Data Professions

Captura de Tela 2023-03-11 às 06 02 53

𝐃𝐚𝐭𝐚 𝐒𝐜𝐢𝐞𝐧𝐜𝐞 and 𝐃𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐭𝐢𝐜𝐬 are fields that join programming, mathematics, and business. Now, before knowing the difference between the two you should understand both terms. So starting with Data Science

  • Data Science – It is a term for different models and methods to get information. In easier words. Data Science is a combination of various tools, machine learning principles, and algorithms with the aim to find patterns from the raw data.

  • Data Analytics – It is the process of increasing productivity and business gain. Hear data sets are examined to draw conclusions about the information they contain. Information is extricated and classified to identify and analyze conduct information, and different techniques are there according to organizational requirements. We also called it data analysis.

Let’s understand the roles of Data Scientists and Data Analysts

𝐃𝐚𝐭𝐚 𝐒𝐜𝐢𝐞𝐧𝐜𝐞

  1. Required knowledge of Applied Statistics, Data Mining, and computing algorithms like neural networks and machine learning.
  2. Knowledge of database systems like MySQL, Hive etc. is required.
  3. Data Science is used in broader categories such as digital advertising or internet searches.
  4. Data Science plays a role in developing machine learning and AI.
  5. Then they formulate an algorithm which is developed by data analysts.

𝐃𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬

  1. Required Data Fetching and Querying skills.
  2. Data blending, data cleaning, data discovery, and data visualization are the major tasks in a data analyst’s job.
  3. Basic statistics knowledge is required.
  4. The perfect industry is travel, gaming, or healthcare, where analysts can extract data to improve business.

So, this was all about 𝐃𝐚𝐭𝐚 𝐒𝐜𝐢𝐞𝐧𝐜𝐞 and 𝐃𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬

Main Python Libraries for Data Analysis and Visualization

Captura de Tela 2023-03-11 às 06 29 23

This library is used a lot when we're talking about a lot of data, so you've certainly heard or will hear a lot about this library.

For this reason it is very useful, as it has a longer data processing time, that is, when we have more data what we want is exactly a greater amount of processing!

So it will be very useful when we are talking about modeling very big data, pandas, artificial intelligence.

Everything you have a large amount of data you will see that this library will be present.

P.S: If you are interested in entering this area, this library will be very useful for you. But even if you're not interested, this library will also help you improve the processing time of your codes, so it will come in handy one way or another!

How to use the Numpy Library in Python ?

If you want to work in the field of data science, data analysis, data processing in Python and need a good tool to work with a large amount of data You may have heard of the Numpy library! She is going to help you. Here, I will teach you how to use this library so that you can solve your problems

  • The first step before starting is to install the library, for that just go to the anaconda prompt (if you have Jupyter installed) and type pip install numpy
  • Once this is done, we can start the code imported into the library with the import numpy as np command Now we can start!

Before starting the code it is important that you know what an array is, this is nothing more than a set of data that can be arranged in different dimensions

We have a few types of arrays in these different dimensions, and you may have heard of at least one of them

  • 1D array – It has only one dimension. It will be commonly called vector or vector;
  • 2D array – It has 2 dimensions. It will be commonly called a matrix or matrix;
  • 3D or More array – Has 3 or more dimensions. It will be commonly called a tensor Now I'm going to show you how to create an Array in Python (how to create an array in Python)

Creating an array

Captura de Tela 2023-03-11 às 11 07 33

Here we are going to use np.array to create a one-dimensional data set and assign this set to a variable

P.S: Remembering that we left all the documentation links in the file so that you can access them whenever you need to clear your doubts and make your queries

As sets of information equal to 0 or equal to 1 are often used in these processes, we also have some specific codes for this creation

np.zeros()

Captura de Tela 2023-03-11 às 11 13 59

In this case, for example, we are using np.zeros to create a set with zeros (you can use np.ones to create a set with 1)

Only now we are adding the shape, but what is that?

It's actually a way of telling Python the dimensions of our array. So it's like 5 matrices of 3 rows and 6 columns

So that you can better visualize this, it's as if you had an object in 3 dimensions, with height, width and depth

Representation of Arrays in 3 dimensions

Captura de Tela 2023-03-11 às 11 17 23

This image is perfect to show you how the 3 dimensions look like, so in this part of images you will see a lot of that

Where a color image is composed of 3 dimensions of matrices, each with the colors Red, Green and Blue which are the colors known as RGB

Creating a sequence of numbers

Captura de Tela 2023-03-11 às 11 22 11

For np.arrange we are able to create data sets in sequence without you wasting time having to write this data set

Then we can both inform the amount of data we are going to have and a sequence will be made from 0 to that value (as in the first example from 0 to 9, totaling 10 elements)

And in the second example we have the information we want from the number 3 to 15 increasing by 2 by 2

So they are faster ways to build an array without having to write everything manually, which would be horrible for very large datasets

Creating a linear sequence

Captura de Tela 2023-03-11 às 11 24 49

np.linspace allows you to create a linear sequence of equally spaced values.

In this case we want a sequence from 0 to 100 with 2 elements. So the code itself will already make this equal spacing and will already give you all the elements of this data set.

P.S: In this case we have the endpoint = False, and this means that we will not consider the value 100, it does not need to appear within our set, so the linear spacing would be 5 by 5, otherwise you can do a test to see how would this set look if you consider the full stop!

Another very important point is the retstep = True, which is the number that is shown after our dataset. It is nothing more than the spacing that was given between each of the numbers.

In this case, it's 5 out of 5 and you can easily see that, but in some cases, to not have to do the math, just use this feature!

Finding the Size of an Array

Captura de Tela 2023-03-11 às 11 28 26

For this next step, we have some commands to discover the information of an array, in addition to print, because as we can have a very large data set, it may not be very useful to use print

Then we have the shape, size and ndim. What are the datasets going to format, how big is that set and the number of dimensions we have in that set

Another very interesting point to address in arrays is the concatenate part, as we can join two arrays into one

Concatenating arrays

Captura de Tela 2023-03-11 às 11 32 39

Sometimes it's necessary to join the information to build another array depending on what you're working with, so it's important to know not only how to build them, but how to work with them inside Python

Querying items from an Array

Captura de Tela 2023-03-11 às 11 36 02

So of course we are going to cover some more useful tools so that you can work with these arrays smoothly

One point that is needed just as we have lists is to query information within a dataset

In this case we are checking which values within our dataset are less than 8

Operations with Arrays

Captura de Tela 2023-03-11 às 11 39 15

As with any treatment or data analysis it is necessary to do some operations and with arrays it would not be different, so we have some of the main operations that we can do with arrays

Generating random samples

Captura de Tela 2023-03-11 às 11 42 12

Another very interesting point within data analysis is that sometimes we need random values, whether to test some code or even test a tool

So instead of creating a sequence or something that would be too “trivial” for the code or tool to use, we can create arrays with random values

But for this case we have to import the default_rng library and use rng.integers which will generate random integers

If you've never used random numbers like this, don't worry, it's quite common in data analysis

Difference between Arrays and Lists

Captura de Tela 2023-03-11 às 11 48 22

Finally, we will show you some differences between arrays and lists. Because they are very similar structures, but they have some differences

The first one is how Python represents it, just below we have the type how Python classifies the two

So in these two ways you can already identify whether it is a list or an array (even if you don't have the classification)

Another very important point is that the array does not allow different data types, so if we put a text inside this data set, Python starts to classify the data inside it as a string instead of an integer

In lists this does not happen, each element will be classified according to its content, so one element can be classified as an integer while the other can be classified as a string

Pandas

Pandas

Pandas is an open-source, free-to-use (under a BSD license) Python library that provides tools for analyzing and manipulating data

Pandas allows you to work with different types of data, for example:

  • Tabular data, such as an Excel spreadsheet or SQL table
  • Data sorted temporally or not
  • Matrices
  • Any other set of data, which does not necessarily need to be labeled

The magic of reading, manipulating, aggregating and displaying data with just a few commands explains why the library has become so popular. By the way, all this is possible due to the primary structures of Pandas, the famous Series and DataFrames

Data Processing in Python

The first step in introducing Pandas is to import this library into Python.

The default call is just like this: import pandas, however we are going to use it like this: import pandas as pd

Let's do it this way, as it is a way to facilitate the writing of codes using this library, because by default we have to write pandas.(desired command). For an introduction to Pandas and even for those who have been using it for some time, it is much more comfortable this way

So instead of always writing pandas, we will be able to write pd.(desired command) reducing the size of the writing and making programming easier

Another important point is to understand that pandas works with DataFrames, which are nothing more than tables inside Python

Now I'll show you how to create a dateframe from a dictionary (which would be a description of what happened)

Creating a dataframe from a dictionary

Captura de Tela 2023-03-13 às 09 20 52

In this case we have to use pd.DataFrame() right at the beginning, but that is to create an empty dateframe which is not very usual

That is why in the line below we are creating a sales dictionary, so we have some sales information such as: date, value, product and quantity

P.S: It is important to check the structure of this dictionary so that the data is stored correctly.

In the last line we will create a variable to be able to assign our dataframe with the code pd.DateFrame(sale). So we are assigning a table to this variable sales_df

P.S: This df that was placed in the variable is just an indication to make it easier for us to know that this variable is a dataframe. You can put table_sales for example!

Creating a dataframe is very important for data visualization, because if the user simply puts a print(sales) he will only have the dictionary shown on the screen

Data visualization

Captura de Tela 2023-03-13 às 09 32 34

Here we are going to check the difference between data visualization in Python with print and display

These two options will give you the same result, however with the print, we have a more notepad look (but still organized)

Now with the use of the display, we have something much more visual and easier to visualize the data, so when we are going to show some result it is also important to verify that the information is being shown in an easy-to-understand way

The other method to create a dataframe is by importing files and data simple useful.

IMPORTANT: For this example we are going to open a file in Excel, however this file needs to be in the same place where we have our code file

Importing files and database

Captura de Tela 2023-03-13 às 09 38 27

In case you want to pull the file from another location, you will have to put the full path of the file where we wrote its name

It's something that takes more work, but it works smoothly, it's more comfortable and easier to write just the file name

See that we have the worksheet being shown normally in the dataframe format already with the most suitable look with the use of the display

IMPORTANT: It is good to point out that this code, when executed, may take time on your computer, as it is a database with 90,000 lines, so in fact it has a considerable amount of data

Another point that is good to take into account is that Python only showed a few lines from the beginning and end of the table so as not to have to show all the data and leave the user lost

But this way it is possible for the user to see the structure of the table so that he can work with this information properly

Now let's see the simple and useful data visualization summaries part. What does that mean? That we have some methods to facilitate the visualization

Summary of Simple and Useful Data Visualizations

Captura de Tela 2023-03-13 às 09 49 46

Here we are initially using .head() which is for the user to choose how many rows he wants to view from this database

By default Python puts only the first 5 lines, but in this example we put the first 10 lines of this database

This method is important so that you can check whether the data is correct and the table structure is also correct.

In the second example we have the .shape method, which will show us how many rows and how many columns this database has

Finally let's check the .describe method which is very useful and interesting. It will give you a summary of the numerical information that we have in our database.

Then you will have an overview of these items and a summary to be able to facilitate certain analyzes without having to do any treatment in the table

Now we are going to move on to the dataframe editing methods

IMPORTANT: It is very important to point out that whenever we have pd.series it means that we have a pandas series, what is that?

It is nothing more than a single column or a single row of your dataframe. It is important to say this, because the next method we are going to use is to get specific columns

And if you take just one column, you will see that even with the display method, it will not appear all formatted and beautiful

So when we use it this way: products = sales_df['Product'] we will only have a single unformatted column

Now for more columns we can put another 2 square brackets and bring it normally, already formatted

Take 1 column

Captura de Tela 2023-03-13 às 09 58 48

This method is for getting just columns, but what if you want to get a row, or rows, or even a specific value?

For this we will use the .loc[] method to be able to make this part more specific

Get multiple rows and/or columns

Captura de Tela 2023-03-13 às 10 01 36

IMPORTANT: In the first method we are taking from line 1 to line 5, however pandas will consider the numbers on the left, which it assigns itself. So it's very important to remember that it starts at zero so you don't lose the first piece of information

In the first example we are just taking rows 1 to 5 from our table

In the second example, we are getting all the information, in which the Store ID column is equal to Norte Shopping, that is, we are limiting our search to this information only

In the third example we are going to repeat what we did in the second, but we are going to choose the columns that we are going to store with this data, this is important when you don't need or don't want to show all the columns of the table

Add 1 column

Captura de Tela 2023-03-13 às 10 05 47

Now we are going to see how we can create or add a column inside our table

There are two ways, the first is using an existing column to compose the new one, or assigning a default value to all the information in that column

P.S: Remembering that when we use the : (colon) inside the loc it means that we are wanting to select all rows or columns (depending on where you put it)

Now that we've learned how to insert columns, let's also learn how we can insert rows, that is, how we can insert new data into our dataframe

Captura de Tela 2023-03-13 às 10 09 25

In this case, we are importing the sales database that contains all December sales into Python again

Next, we are going to join this data so that our database is complete with the data we have, plus the data for December

For this, we will use the .append() method to indicate that we want to insert the base vendas_dez_df information to the sales_df

Well, now that we've learned how to insert rows and columns, let's learn how to delete rows and columns

Delete rows and columns

Captura de Tela 2023-03-13 às 10 14 06

In this case, it's important to check the arguments of the .drop() method, because in the first argument we'll need the line number or column name

And in the second argument we have to have the axis that this action will happen, so if the axis is equal to 0 we will be on the axis of the rows, if the axis is equal to 1 we will be on the axis of the columns

So far you have learned the basic pandas commands, however whenever you are going to do data analysis or data processing we have some important commands

So let's take advantage of this and pass it on to this extra part!

MORE - For Data Treatment and Analysis

The first commands that we are going to present are the commands to treat empty values, that is, those values that you saw in our table that were as NaN.

Empty values

Captura de Tela 2023-03-13 às 10 20 21

In the first example we have something similar to the method to delete rows and columns, except that in this case we will pass the how argument (as) being equal to all (all) to use the .dropna() method correctly

In this case we will only exclude columns that are completely empty, that is, a column that has no information

In the second example we will use it when we want to delete an entire line if at least one of the values is empty

In the third example we will fill in empty values with the average of the values we already have in that column. So let's use the .fillna() method which is for filling, along with the .mean() which is actually the average

In the fourth example, we have another way to fill it, which is using the value that is just above it

This is often used when we have a database where we don't want to repeat items, names, products... so they only put it once, so to always fill it with the value above, we're going to use the ffill() method

Now let's move on to a very interesting and widely used part of data analysis, which is the part on how to calculate the indicators

That is to say, what is the total amount, what is the revenue per store and so on

Calculation of indicators

Captura de Tela 2023-03-13 às 10 27 08

In the first example we have the method .value_counts() which serves to count the values that we have inside a column

In this case, we are counting the amount of transactions that were made per store, so we will have a summary of how many transactions each store made in an easy and fast way.

In the second example we will use the method .groupby() which is to group by. Next, let's put this grouping together, we'll add this information with .sum()

In this case we are only showing two columns, because sometimes we don't want to show the whole table, so it's important to hide some details when necessary

Finally, we are grouping by products, that is, we will have all the products and we will add the final value of each one of them, in this way we will know the total value of each one of the products in that store

Now let's go to the last method that we are going to explain in this class, which is the method to merge 2 dataframes, that is, we will be able to look for information from one dataframe in the other

This means that we will be able to do a search between two different tables

IMPORTANT: It is necessary that these two tables have a column with information in common so that the search can be carried out

Merge 2 dataframes

Captura de Tela 2023-03-13 às 10 32 07

First we will import the file, again using the .read_excel() method, then we will be able to merge using the .merge() method

As we already have a column with the same name, pandas will already do this search and will return the information from the table that we are going to merge

This means that we are going to insert the information from the managers table into the database that we already have.

Here we finish our introduction to Pandas, did you like everything you learned?

Matplotlib

Captura de Tela 2023-03-13 às 10 53 23

Matplotlib is a cross-platform, data visualization and graphical plotting library for Python and its numerical extension NumPy. As such, it offers a viable open source alternative to MATLAB. Developers can also use matplotlib’s APIs (Application Programming Interfaces) to embed plots in GUI applications.

A Python matplotlib script is structured so that a few lines of code are all that is required in most instances to generate a visual data plot. The matplotlib scripting layer overlays two APIs:

  • The pyplot API is a hierarchy of Python code objects topped by matplotlib.pyplot
  • An OO (Object-Oriented) API collection of objects that can be assembled with greater flexibility than pyplot. This API provides direct access to Matplotlib’s backend layers.

How to Create Graphs in Python

For this lesson on graphics we will use the matplotlib library and we will use Jupyter

In Jupyter this matplotlib library is already installed, however, if you want to update or check if the library is up to date, you can use the command pip install -U matplotlib

Now let's start the actual programming, the first step is to import this library so that we can use its resources

Importing the matplotlib library

Captura de Tela 2023-03-13 às 11 00 20

Let's use the pyplot sub module of this library. Another important point is that whenever you visualize a library being imported and then the as command, this is to make your life easier!

In this case, whenever you use a command, you don't have to write matplotlib.pyplot and just write plt

Now we are going to create the graphs, we are going to use the plot() function, remembering that within the document that we make available for download there are some useful links to the documentation of these structures

https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html?highlight=pyplot%20plot#matplotlib-pyplot-plot

Creating a standard chart

Captura de Tela 2023-03-13 às 11 04 01

Here we are giving values to the x and y variables in order to create and display our first graph

Inside the available file we have an image that will be very useful in creating your graphics, as it will help you how to format graphics in Python

Options that can be changed within the chart

Captura de Tela 2023-03-13 às 11 07 30

This image shows everything that can be changed within a chart. So everything circled in blue text can be modified by the user.

P.S: It is worth remembering that within the file that is available for download, you will be able to click on all the links to view the documentation for each of the parts. This way you will be able to check the other options you have and even make inquiries when you have any questions or need additional information.

Entering information into the chart to make it more detailed

Captura de Tela 2023-03-13 às 11 11 00

Here we already have a new chart with some changes to make the chart more visual

We can change the properties of the graphs, in this for example we can change the properties of the lines

Changing the graph's line style

Captura de Tela 2023-03-13 às 11 13 57

In addition to plotting the graph, we can change the style of the line (linestyle), as well as the color of the line (color) to facilitate visualization

We can also use different types of graph, such as the dot graph (scatter) or even bar graph (bar)

scatter()

Captura de Tela 2023-03-13 às 11 17 39

bar()

Captura de Tela 2023-03-13 às 11 19 43

Dot chart and bar chart

Here we can modify the type of marker in the graph, which is the way it will be used to represent the data, in this case we use circles, so it is similar to the scatter graph, but it is not the same

Changing chart markers

Captura de Tela 2023-03-13 às 11 23 34

In this case we are putting it in red (r) and we are representing the data with circles (o) that's why the “ro” inside the plot

Of course, you can change both the color and the type of marker, on the website we put in the file you will see a variety of markers that you can use to complement your graphic

So it depends on the need you have for creating the graph, as there will be several possibilities, so you can use whatever is most appropriate for the situation

Changing the boundaries (axes) of the chart

Captura de Tela 2023-03-13 às 11 26 45

We can also change the limits of the chart, to adjust the size according to your needs.

So inside axis we can put the minimum and maximum limits of the x and y axes, so we will have a size according to your needs and not a variable size according to every piece of information you put

Using the subplot (create more than one graph in the same visual)

Captura de Tela 2023-03-13 às 11 30 20

We can also create figures and subplots, that is, we can adjust more than one graph in the same visual, so we can show more than one result at the same time

This means that we can create an area to create these graphics and facilitate their visualization

Finally, let's leave the last graph as a challenge, which is a practical example, which will use the pandas library to analyze the Kaggle data source!

The idea is to process the data and create a graph with the NY stock exchange rate

IMPORTANT: Do not worry that in the available file we already have the codes to adjust the database and carry out this treatment

Let's practice ?

I will provide a solved exercise to practice

For this project, imagine that your boss makes a database available so that you can analyze it based on your knowledge of Python. For this task, you must use Matplotlib to visualize some graphs and other libraries, Pandas and NumPy for data analysis and manipulation. With the correct use of language features, you will be able to correctly conduct data analysis and visualization, the basic work of a Data Scientist To start your project, follow the instructions below:

 - Download the files: 
  '1-dadosgovbr---2014.csv' , 'Project.ipynb' and store them in the same folder where you will store your code files.
 - Load the .csv table so you can read data from it
 - Print part of the content to check if the reading is happening correctly
 - I made the resolved project available in the file: 'projectanswered.ipynb'

Note: In this first step, indicated by the instructions above, I already helped you, indicating the way to load the table, according to the code below :D #ThanksGod

About

📈 Numpy, Pandas e Matplotlib

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published