Some abstracts do not have video files because ASAS was denied recording rights.

1295
Big data analysis techniques

Wednesday, July 20, 2016: 11:35 AM
155 C (Salt Palace Convention Center)
Normand St-Pierre , Ohio State University, Columbus, OH
Abstract Text:

The term ‘big data’ has recently entered our lexicon.  Data scientists and statisticians have loosely defined big data as datasets with billions (109) of rows (tupples) of data.  Hence, very few datasets in the animal sciences would qualify as true big data.  At best, we deal with large datasets in the millions of tupples.  Regardless, some of the same issues surrounding big data analyses are shared with large data: (1) near certainty of the presence of outliers, and (2) low signal to noise (irrelevant variables, subtle relationships, data imbalance, near collinearity).  In large datasets, outliers are more than unidimensional: higher dimensions must be scrutinized.  An example of this involved the characterization of feed composition data.  Techniques used to address the low signal to noise issue can be classified into 2 groups: opaque techniques and black box techniques.  The most prevalent techniques in the first group are: visualization through smoothing, regression, principal component analysis (PCA), decision trees, clustering methods, and multivariate adaptive splines (MARS).  Black box techniques include neural networks, k-nearest neighbor (KNN), K-mean, support vector machines and genetic algorithm.  Each technique will be briefly explained using an example.  With PCA, we first find a direction that has maximum variance.  A second direction is then found, which has maximum variance of all directions perpendicular to the first.  The process is repeated until there are as many directions (vectors) as original variables.  Advantages of PCA are the dimension reduction and the ability to handle more predictors than observations.  Disadvantages are that they often lack interpretation, and are linear models.  Issues when only summary statistics are available (i.e., meta-analysis) will be explained, including the importance of properly weighing observations and accounting for the inherent blocking in the meta-design.

Keywords: big data, principal component analysis, meta-analysis