- At the end of this repository you will have learned data analysis using Python, Numpy, Pandas and Matplotlib libraries
“Big Data” a huge amount of structured or unstructured data that floods a company on a daily basis and therefore cannot be processed by traditional data processing techniques. Interpreting these huge amounts of data is challenging and therefore, requires a serious professional who can process and interpret these data. It can provide viable solutions to companies by uncovering significant trends and insights
- When we talk about Big Data, there are some hardware and storage challenges to perform it
- When we talk about data, information and knowledge, we seek to add value to this data so that both the user and the company can have an added value that can go through both the intellectual and the financial aspects
- Another challenge is data visualization. It allows us to learn from this data in a more user-friendly way through graphics. Through tables, we can generate graphs and make more accessible inferences, therefore, we can make smarter decisions. This helps us to generate more value with respect to data.
- It is interesting to consider about big data, the speed with which data and information are generated is very important, so we need to provide technologies and devices capable of disseminating and dealing with this speed of data-related transformation.
- Data: something incomplete
- Information: gathering data brings meaning
- Knowledge: Conclusions and reflections built through the information, the understanding of information
- Wisdom: what is done with the knowledge gained
𝐃𝐚𝐭𝐚 𝐒𝐜𝐢𝐞𝐧𝐜𝐞 and 𝐃𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐭𝐢𝐜𝐬 are fields that join programming, mathematics, and business. Now, before knowing the difference between the two you should understand both terms. So starting with Data Science
-
Data Science – It is a term for different models and methods to get information. In easier words. Data Science is a combination of various tools, machine learning principles, and algorithms with the aim to find patterns from the raw data.
-
Data Analytics – It is the process of increasing productivity and business gain. Hear data sets are examined to draw conclusions about the information they contain. Information is extricated and classified to identify and analyze conduct information, and different techniques are there according to organizational requirements. We also called it data analysis.
Let’s understand the roles of Data Scientists and Data Analysts
𝐃𝐚𝐭𝐚 𝐒𝐜𝐢𝐞𝐧𝐜𝐞
- Required knowledge of Applied Statistics, Data Mining, and computing algorithms like neural networks and machine learning.
- Knowledge of database systems like MySQL, Hive etc. is required.
- Data Science is used in broader categories such as digital advertising or internet searches.
- Data Science plays a role in developing machine learning and AI.
- Then they formulate an algorithm which is developed by data analysts.
𝐃𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬
- Required Data Fetching and Querying skills.
- Data blending, data cleaning, data discovery, and data visualization are the major tasks in a data analyst’s job.
- Basic statistics knowledge is required.
- The perfect industry is travel, gaming, or healthcare, where analysts can extract data to improve business.
So, this was all about 𝐃𝐚𝐭𝐚 𝐒𝐜𝐢𝐞𝐧𝐜𝐞 and 𝐃𝐚𝐭𝐚 𝐀𝐧𝐚𝐥𝐲𝐬𝐢𝐬
This library is used a lot when we're talking about a lot of data, so you've certainly heard or will hear a lot about this library.
For this reason it is very useful, as it has a longer data processing time, that is, when we have more data what we want is exactly a greater amount of processing!
So it will be very useful when we are talking about modeling very big data, pandas, artificial intelligence.
Everything you have a large amount of data you will see that this library will be present.
P.S: If you are interested in entering this area, this library will be very useful for you. But even if you're not interested, this library will also help you improve the processing time of your codes, so it will come in handy one way or another!
If you want to work in the field of data science, data analysis, data processing in Python and need a good tool to work with a large amount of data You may have heard of the Numpy library! She is going to help you. Here, I will teach you how to use this library so that you can solve your problems
- The first step before starting is to install the library, for that just go to the anaconda prompt (if you have Jupyter installed) and type pip install numpy
- Once this is done, we can start the code imported into the library with the import numpy as np command Now we can start!
Before starting the code it is important that you know what an array is, this is nothing more than a set of data that can be arranged in different dimensions
We have a few types of arrays in these different dimensions, and you may have heard of at least one of them
- 1D array – It has only one dimension. It will be commonly called vector or vector;
- 2D array – It has 2 dimensions. It will be commonly called a matrix or matrix;
- 3D or More array – Has 3 or more dimensions. It will be commonly called a tensor Now I'm going to show you how to create an Array in Python (how to create an array in Python)
Creating an array
Here we are going to use np.array to create a one-dimensional data set and assign this set to a variable
P.S: Remembering that we left all the documentation links in the file so that you can access them whenever you need to clear your doubts and make your queries
As sets of information equal to 0 or equal to 1 are often used in these processes, we also have some specific codes for this creation
np.zeros()
In this case, for example, we are using np.zeros to create a set with zeros (you can use np.ones to create a set with 1)
Only now we are adding the shape, but what is that?
It's actually a way of telling Python the dimensions of our array. So it's like 5 matrices of 3 rows and 6 columns
So that you can better visualize this, it's as if you had an object in 3 dimensions, with height, width and depth
Representation of Arrays in 3 dimensions
This image is perfect to show you how the 3 dimensions look like, so in this part of images you will see a lot of that
Where a color image is composed of 3 dimensions of matrices, each with the colors Red, Green and Blue which are the colors known as RGB
Creating a sequence of numbers
For np.arrange we are able to create data sets in sequence without you wasting time having to write this data set
Then we can both inform the amount of data we are going to have and a sequence will be made from 0 to that value (as in the first example from 0 to 9, totaling 10 elements)
And in the second example we have the information we want from the number 3 to 15 increasing by 2 by 2
So they are faster ways to build an array without having to write everything manually, which would be horrible for very large datasets
Creating a linear sequence
np.linspace allows you to create a linear sequence of equally spaced values.
In this case we want a sequence from 0 to 100 with 2 elements. So the code itself will already make this equal spacing and will already give you all the elements of this data set.
P.S: In this case we have the endpoint = False, and this means that we will not consider the value 100, it does not need to appear within our set, so the linear spacing would be 5 by 5, otherwise you can do a test to see how would this set look if you consider the full stop!
Another very important point is the retstep = True, which is the number that is shown after our dataset. It is nothing more than the spacing that was given between each of the numbers.
In this case, it's 5 out of 5 and you can easily see that, but in some cases, to not have to do the math, just use this feature!
Finding the Size of an Array
For this next step, we have some commands to discover the information of an array, in addition to print, because as we can have a very large data set, it may not be very useful to use print
Then we have the shape, size and ndim. What are the datasets going to format, how big is that set and the number of dimensions we have in that set
Another very interesting point to address in arrays is the concatenate part, as we can join two arrays into one
Concatenating arrays
Sometimes it's necessary to join the information to build another array depending on what you're working with, so it's important to know not only how to build them, but how to work with them inside Python
Querying items from an Array
So of course we are going to cover some more useful tools so that you can work with these arrays smoothly
One point that is needed just as we have lists is to query information within a dataset
In this case we are checking which values within our dataset are less than 8
Operations with Arrays
As with any treatment or data analysis it is necessary to do some operations and with arrays it would not be different, so we have some of the main operations that we can do with arrays
Generating random samples
Another very interesting point within data analysis is that sometimes we need random values, whether to test some code or even test a tool
So instead of creating a sequence or something that would be too “trivial” for the code or tool to use, we can create arrays with random values
But for this case we have to import the default_rng library and use rng.integers which will generate random integers
If you've never used random numbers like this, don't worry, it's quite common in data analysis
Difference between Arrays and Lists
Finally, we will show you some differences between arrays and lists. Because they are very similar structures, but they have some differences
The first one is how Python represents it, just below we have the type how Python classifies the two
So in these two ways you can already identify whether it is a list or an array (even if you don't have the classification)
Another very important point is that the array does not allow different data types, so if we put a text inside this data set, Python starts to classify the data inside it as a string instead of an integer
In lists this does not happen, each element will be classified according to its content, so one element can be classified as an integer while the other can be classified as a string
Pandas is an open-source, free-to-use (under a BSD license) Python library that provides tools for analyzing and manipulating data
Pandas allows you to work with different types of data, for example:
- Tabular data, such as an Excel spreadsheet or SQL table
- Data sorted temporally or not
- Matrices
- Any other set of data, which does not necessarily need to be labeled
The magic of reading, manipulating, aggregating and displaying data with just a few commands explains why the library has become so popular. By the way, all this is possible due to the primary structures of Pandas, the famous Series and DataFrames
Data Processing in Python
The first step in introducing Pandas is to import this library into Python.
The default call is just like this: import pandas, however we are going to use it like this: import pandas as pd
Let's do it this way, as it is a way to facilitate the writing of codes using this library, because by default we have to write pandas.(desired command). For an introduction to Pandas and even for those who have been using it for some time, it is much more comfortable this way
So instead of always writing pandas, we will be able to write pd.(desired command) reducing the size of the writing and making programming easier
Another important point is to understand that pandas works with DataFrames, which are nothing more than tables inside Python
Now I'll show you how to create a dateframe from a dictionary (which would be a description of what happened)
Creating a dataframe from a dictionary
In this case we have to use pd.DataFrame() right at the beginning, but that is to create an empty dateframe which is not very usual
That is why in the line below we are creating a sales dictionary, so we have some sales information such as: date, value, product and quantity
P.S: It is important to check the structure of this dictionary so that the data is stored correctly.
In the last line we will create a variable to be able to assign our dataframe with the code pd.DateFrame(sale). So we are assigning a table to this variable sales_df
P.S: This df that was placed in the variable is just an indication to make it easier for us to know that this variable is a dataframe. You can put table_sales for example!
Creating a dataframe is very important for data visualization, because if the user simply puts a print(sales) he will only have the dictionary shown on the screen
Data visualization
Here we are going to check the difference between data visualization in Python with print and display
These two options will give you the same result, however with the print, we have a more notepad look (but still organized)
Now with the use of the display, we have something much more visual and easier to visualize the data, so when we are going to show some result it is also important to verify that the information is being shown in an easy-to-understand way
The other method to create a dataframe is by importing files and data simple useful.
IMPORTANT: For this example we are going to open a file in Excel, however this file needs to be in the same place where we have our code file
Importing files and database
In case you want to pull the file from another location, you will have to put the full path of the file where we wrote its name
It's something that takes more work, but it works smoothly, it's more comfortable and easier to write just the file name
See that we have the worksheet being shown normally in the dataframe format already with the most suitable look with the use of the display
IMPORTANT: It is good to point out that this code, when executed, may take time on your computer, as it is a database with 90,000 lines, so in fact it has a considerable amount of data
Another point that is good to take into account is that Python only showed a few lines from the beginning and end of the table so as not to have to show all the data and leave the user lost
But this way it is possible for the user to see the structure of the table so that he can work with this information properly
Now let's see the simple and useful data visualization summaries part. What does that mean? That we have some methods to facilitate the visualization
Summary of Simple and Useful Data Visualizations
Here we are initially using .head() which is for the user to choose how many rows he wants to view from this database
By default Python puts only the first 5 lines, but in this example we put the first 10 lines of this database
This method is important so that you can check whether the data is correct and the table structure is also correct.
In the second example we have the .shape method, which will show us how many rows and how many columns this database has
Finally let's check the .describe method which is very useful and interesting. It will give you a summary of the numerical information that we have in our database.
Then you will have an overview of these items and a summary to be able to facilitate certain analyzes without having to do any treatment in the table
Now we are going to move on to the dataframe editing methods
IMPORTANT: It is very important to point out that whenever we have pd.series it means that we have a pandas series, what is that?
It is nothing more than a single column or a single row of your dataframe. It is important to say this, because the next method we are going to use is to get specific columns
And if you take just one column, you will see that even with the display method, it will not appear all formatted and beautiful
So when we use it this way: products = sales_df['Product'] we will only have a single unformatted column
Now for more columns we can put another 2 square brackets and bring it normally, already formatted
Take 1 column
This method is for getting just columns, but what if you want to get a row, or rows, or even a specific value?
For this we will use the .loc[] method to be able to make this part more specific
Get multiple rows and/or columns
IMPORTANT: In the first method we are taking from line 1 to line 5, however pandas will consider the numbers on the left, which it assigns itself. So it's very important to remember that it starts at zero so you don't lose the first piece of information
In the first example we are just taking rows 1 to 5 from our table
In the second example, we are getting all the information, in which the Store ID column is equal to Norte Shopping, that is, we are limiting our search to this information only
In the third example we are going to repeat what we did in the second, but we are going to choose the columns that we are going to store with this data, this is important when you don't need or don't want to show all the columns of the table
Add 1 column
Now we are going to see how we can create or add a column inside our table
There are two ways, the first is using an existing column to compose the new one, or assigning a default value to all the information in that column
P.S: Remembering that when we use the : (colon) inside the loc it means that we are wanting to select all rows or columns (depending on where you put it)
Now that we've learned how to insert columns, let's also learn how we can insert rows, that is, how we can insert new data into our dataframe
In this case, we are importing the sales database that contains all December sales into Python again
Next, we are going to join this data so that our database is complete with the data we have, plus the data for December
For this, we will use the .append() method to indicate that we want to insert the base vendas_dez_df information to the sales_df
Well, now that we've learned how to insert rows and columns, let's learn how to delete rows and columns
Delete rows and columns
In this case, it's important to check the arguments of the .drop() method, because in the first argument we'll need the line number or column name
And in the second argument we have to have the axis that this action will happen, so if the axis is equal to 0 we will be on the axis of the rows, if the axis is equal to 1 we will be on the axis of the columns
So far you have learned the basic pandas commands, however whenever you are going to do data analysis or data processing we have some important commands
So let's take advantage of this and pass it on to this extra part!
MORE - For Data Treatment and Analysis
The first commands that we are going to present are the commands to treat empty values, that is, those values that you saw in our table that were as NaN.
Empty values
In the first example we have something similar to the method to delete rows and columns, except that in this case we will pass the how argument (as) being equal to all (all) to use the .dropna() method correctly
In this case we will only exclude columns that are completely empty, that is, a column that has no information
In the second example we will use it when we want to delete an entire line if at least one of the values is empty
In the third example we will fill in empty values with the average of the values we already have in that column. So let's use the .fillna() method which is for filling, along with the .mean() which is actually the average
In the fourth example, we have another way to fill it, which is using the value that is just above it
This is often used when we have a database where we don't want to repeat items, names, products... so they only put it once, so to always fill it with the value above, we're going to use the ffill() method
Now let's move on to a very interesting and widely used part of data analysis, which is the part on how to calculate the indicators
That is to say, what is the total amount, what is the revenue per store and so on
Calculation of indicators
In the first example we have the method .value_counts() which serves to count the values that we have inside a column
In this case, we are counting the amount of transactions that were made per store, so we will have a summary of how many transactions each store made in an easy and fast way.
In the second example we will use the method .groupby() which is to group by. Next, let's put this grouping together, we'll add this information with .sum()
In this case we are only showing two columns, because sometimes we don't want to show the whole table, so it's important to hide some details when necessary
Finally, we are grouping by products, that is, we will have all the products and we will add the final value of each one of them, in this way we will know the total value of each one of the products in that store
Now let's go to the last method that we are going to explain in this class, which is the method to merge 2 dataframes, that is, we will be able to look for information from one dataframe in the other
This means that we will be able to do a search between two different tables
IMPORTANT: It is necessary that these two tables have a column with information in common so that the search can be carried out
Merge 2 dataframes
First we will import the file, again using the .read_excel() method, then we will be able to merge using the .merge() method
As we already have a column with the same name, pandas will already do this search and will return the information from the table that we are going to merge
This means that we are going to insert the information from the managers table into the database that we already have.
Here we finish our introduction to Pandas, did you like everything you learned?
Matplotlib is a cross-platform, data visualization and graphical plotting library for Python and its numerical extension NumPy. As such, it offers a viable open source alternative to MATLAB. Developers can also use matplotlib’s APIs (Application Programming Interfaces) to embed plots in GUI applications.
A Python matplotlib script is structured so that a few lines of code are all that is required in most instances to generate a visual data plot. The matplotlib scripting layer overlays two APIs:
- The pyplot API is a hierarchy of Python code objects topped by matplotlib.pyplot
- An OO (Object-Oriented) API collection of objects that can be assembled with greater flexibility than pyplot. This API provides direct access to Matplotlib’s backend layers.
How to Create Graphs in Python
For this lesson on graphics we will use the matplotlib library and we will use Jupyter
In Jupyter this matplotlib library is already installed, however, if you want to update or check if the library is up to date, you can use the command pip install -U matplotlib
Now let's start the actual programming, the first step is to import this library so that we can use its resources
Importing the matplotlib library
Let's use the pyplot sub module of this library. Another important point is that whenever you visualize a library being imported and then the as command, this is to make your life easier!
In this case, whenever you use a command, you don't have to write matplotlib.pyplot and just write plt
Now we are going to create the graphs, we are going to use the plot() function, remembering that within the document that we make available for download there are some useful links to the documentation of these structures
Creating a standard chart
Here we are giving values to the x and y variables in order to create and display our first graph
Inside the available file we have an image that will be very useful in creating your graphics, as it will help you how to format graphics in Python
Options that can be changed within the chart
This image shows everything that can be changed within a chart. So everything circled in blue text can be modified by the user.
P.S: It is worth remembering that within the file that is available for download, you will be able to click on all the links to view the documentation for each of the parts. This way you will be able to check the other options you have and even make inquiries when you have any questions or need additional information.
Entering information into the chart to make it more detailed
Here we already have a new chart with some changes to make the chart more visual
We can change the properties of the graphs, in this for example we can change the properties of the lines
Changing the graph's line style
In addition to plotting the graph, we can change the style of the line (linestyle), as well as the color of the line (color) to facilitate visualization
We can also use different types of graph, such as the dot graph (scatter) or even bar graph (bar)
scatter()
bar()
Dot chart and bar chart
Here we can modify the type of marker in the graph, which is the way it will be used to represent the data, in this case we use circles, so it is similar to the scatter graph, but it is not the same
Changing chart markers
In this case we are putting it in red (r) and we are representing the data with circles (o) that's why the “ro” inside the plot
Of course, you can change both the color and the type of marker, on the website we put in the file you will see a variety of markers that you can use to complement your graphic
So it depends on the need you have for creating the graph, as there will be several possibilities, so you can use whatever is most appropriate for the situation
Changing the boundaries (axes) of the chart
We can also change the limits of the chart, to adjust the size according to your needs.
So inside axis we can put the minimum and maximum limits of the x and y axes, so we will have a size according to your needs and not a variable size according to every piece of information you put
Using the subplot (create more than one graph in the same visual)
We can also create figures and subplots, that is, we can adjust more than one graph in the same visual, so we can show more than one result at the same time
This means that we can create an area to create these graphics and facilitate their visualization
Finally, let's leave the last graph as a challenge, which is a practical example, which will use the pandas library to analyze the Kaggle data source!
The idea is to process the data and create a graph with the NY stock exchange rate
IMPORTANT: Do not worry that in the available file we already have the codes to adjust the database and carry out this treatment
I will provide a solved exercise to practice
For this project, imagine that your boss makes a database available so that you can analyze it based on your knowledge of Python. For this task, you must use Matplotlib to visualize some graphs and other libraries, Pandas and NumPy for data analysis and manipulation. With the correct use of language features, you will be able to correctly conduct data analysis and visualization, the basic work of a Data Scientist To start your project, follow the instructions below:
- Download the files:
'1-dadosgovbr---2014.csv' , 'Project.ipynb' and store them in the same folder where you will store your code files.
- Load the .csv table so you can read data from it
- Print part of the content to check if the reading is happening correctly
- I made the resolved project available in the file: 'projectanswered.ipynb'
Note: In this first step, indicated by the instructions above, I already helped you, indicating the way to load the table, according to the code below :D #ThanksGod