When it is applied to a data frame it treats each column as a separate vector. The main application of the describe function is that of supplying statistical information about the contents of a vector. This example applies the describe function to a data frame, so it produces more results. This example applies the describe function to a simple vector. Here are two r code examples showing the describe function in action. It does not distinguish a response variable, but it will sample it like any other value. The function gives you the Gmd value, which supplies both deviation and variance information. It does not give you the standard deviation or sample variance, but it does supply the range of values. While the mean is the central tendency that all numeric data sets get, the median is reported for vectors along with the other quantiles. The information it supplies depends upon what it is examining. The describe function supplies a lot of statistical information. The exact content of the table depends upon the data structure being analyzed. It produces a contingency table supplying information about the data set. The function accepts any data type including missing data. When using describe in r, the describe function has the form of describe(dataset), where “dataset” is the data set being described. While it is stuff that you can calculate on your own, this is a quick method of statistical calculation. One form of this is running descriptive statistics on your numeric columns, which supplies a lot of useful information about data. Often summary statistics are needed to help supply useful information about the data. #now show how to adjust the displayed number of digitsĭes <- describe(sat.act) #find the descriptive statistics.When doing statistical analysis and data science, looking at the raw data is often not helpful. Then, for graphic output, see error.crosses, pairs.panels, error.bars, and densityBy, or violinBy Examplesĭescribe(sat.act,IQR=TRUE) #show the interquartile Rangeĭescribe(sat.act,quant=c(.1.25.5.75.90) ) #find the 10th, 25th, 50th, See AlsoĭescribeBy, skew, kurtosi dian, read.clipboard. Comparing measures of sample skewness and kurtosis. One just sets the number of digits, one gives uses signif to make 'prettier' output where all numbers are displayed to the same number of digits. However, to control the number of digits displayed, you can set digits in a print command, rather than losing precision at the descriptive stats level. The object returned is a data frame with the normal precision of R. Thus, for a largish problem (72,000 cases with 1680 variables) which might take 330 seconds, doing it as two sets of 840 variable cuts the time down to 80 seconds. The problem seems to be a memory limitation in that the time taken is an accelerating function of nvars * nobs. To get all of the statistics (but at a cost of greater time) set fast=FALSE. This will provide just n, mean, sd, min, max, range, and standard errors. But if the number of cases x number of variables exceeds (ncol * nrow > 10^7), fast will be set to TRUE. This is done automatically if the fast option is TRUE or for large data sets. Thus, a quick pass with describeData is recommended.įor the greatest speed, at the cost of losing information, do not ask for ranges or for skew and turn off check. If the data set includes columns of character data, it is also not possible. However, if the data are of different types, (factors or logical), this is not possible. Converting the data to a matrix first is recommended. Mad: median absolute deviation (from the median)įor very large data sets that are ames, describe can be rather slow. To describe the data for different groups, see describeBy. The fast=TRUE option will lead to a speed up of about 50% for larger problems by not finding all of the statistics (see NOTE) This is a useful check to run if trying to get descriptive statistics on very large data sets where to improve the speed, the check option is FALSE. The additional helper function describeData just scans the data array and reports on whether the data are all numerical, logical/factorial, or categorical. R: Basic descriptive statistics useful for psychometrics describe m_4/m_2^2).
0 Comments
Leave a Reply. |