You can see that the first column lists the row number, which is handy for referencing a specific observation.
2. Dimensions of Your Data
You must have a very good handle on how much data you have, both in terms of rows and columns.
You can review the shape and size of your dataset by printing the shape property on the Pandas DataFrame.
The results are listed in rows then columns. You can see that the dataset has 768 rows and 9 columns.
3. Data Type For Each Attribute
The type of each attribute is important.
Strings may need to be converted to floating point values or integers to represent categorical or ordinal values.
You can get an idea of the types of attributes by peeking at the raw data, as above. You can also list the data types used by the DataFrame to characterize each attribute using the dtypesproperty.
You can see that most of the attributes are integers and that mass and pedi are floating point values.
4. Descriptive Statistics
Descriptive statistics can give you great insight into the shape of each attribute.
Often you can create more summaries than you have time to review. The describe() function on the Pandas DataFrame lists 8 statistical properties of each attribute:
You can see that you do get a lot of data. You will note some calls to pandas.set_option() in the recipe to change the precision of the numbers and the preferred width of the output. This is to make it more readable for this example.
When describing your data this way, it is worth taking some time and reviewing observations from the results. This might include the presence of “NA” values for missing data or surprising distributions for attributes.
5. Class Distribution (Classification Only)
On classification problems you need to know how balanced the class values are.
Highly imbalanced problems (a lot more observations for one class than another) are common and may need special handling in the data preparation stage of your project.
You can quickly get an idea of the distribution of the class attribute in Pandas.
You can see that there are nearly double the number of observations with class 0 (no onset of diabetes) than there are with class 1 (onset of diabetes).
6. Correlation Between Attributes
Correlation refers to the relationship between two variables and how they may or may not change together.
The most common method for calculating correlation is Pearson’s Correlation Coefficient, that assumes a normal distribution of the attributes involved. A correlation of -1 or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no correlation at all.
Some machine learning algorithms like linear and logistic regression can suffer poor performance if there are highly correlated attributes in your dataset. As such, it is a good idea to review all of the pair-wise correlations of the attributes in your dataset. You can use the corr() function on the Pandas DataFrame to calculate a correlation matrix.
The matrix lists all attributes across the top and down the side, to give correlation between all pairs of attributes (twice, because the matrix is symmetrical). You can see the diagonal line through the matrix from the top left to bottom right corners of the matrix shows perfect correlation of each attribute with itself.
7. Skew of Univariate Distributions
Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or squashed in one direction or another.
Many machine learning algorithms assume a Gaussian distribution. Knowing that an attribute has a skew may allow you to perform data preparation to correct the skew and later improve the accuracy of your models.
You can calculate the skew of each attribute using the skew() function on the Pandas DataFrame.
The skew result show a positive (right) or negative (left) skew. Values closer to zero show less skew.
In this post you discovered the importance of describing your dataset before you start work on your machine learning project.
You discovered 7 different ways to summarize your dataset using Python and Pandas:
Source: ML Mastery