skill_properties.json

[{"job": "data scientist", "skill": "machine learning", "keywords": "machine learning", "description": "Machine learning (ML) is the scientific study of algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as \"training data\", in order to make predictions or decisions without being explicitly programmed to perform the task. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop a conventional algorithm for effectively performing the task.\nMachine learning is closely related to computational statistics, which focuses on making predictions using computers. The study of mathematical optimization delivers methods, theory and application domains to the field of machine learning. Data mining is a field of study within machine learning, and focuses on exploratory data analysis through unsupervised learning. In its application across business problems, machine learning is also referred to as predictive analytics.\n\n\n== Overview ==\nThe name machine learning was coined in 1959 by Arthur Samuel. Tom M. Mitchell provided a widely quoted, more formal definition of the algorithms studied in the machine learning field: \"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P,  improves with experience E.\" This definition of the tasks in which machine learning is concerned offers a fundamentally operational definition rather than defining the field in cognitive terms. This follows Alan Turing's proposal in his paper \"Computing Machinery and Intelligence\", in which the question \"Can machines think?\" is replaced with the question \"Can machines do what we (as thinking entities) can do?\". In Turing's proposal the various characteristics that could be possessed by a thinking machine and the various implications in constructing one are exposed.\n\n\n=== Machine learning tasks ===\n\nMachine learning tasks are classified into several broad categories. In supervised learning, the algorithm builds a mathematical model from a set of data that contains both the inputs and the desired outputs. For example, if the task were determining whether an image contained a certain object, the training data for a supervised learning algorithm would include images with and without that object (the input), and each image would have a label (the output) designating whether it contained the object. In special cases, the input may be only partially available, or restricted to special feedback. Semi-supervised learning algorithms develop mathematical models from incomplete training data, where a portion of the sample input doesn't have labels.\nClassification algorithms and regression algorithms are types of supervised learning. Classification algorithms are used when the outputs are restricted to a limited set of values. For a classification algorithm that filters emails, the input would be an incoming email, and the output would be the name of the folder in which to file the email. For an algorithm that identifies spam emails, the output would be the prediction of either \"spam\" or \"not spam\", represented by the Boolean values true and false. Regression algorithms are named for their continuous outputs, meaning they may have any value within a range. Examples of a continuous value are the temperature, length, or price of an object.\nIn unsupervised learning, the algorithm builds a mathematical model from a set of data which contains only inputs and no desired output labels. Unsupervised learning algorithms are used to find structure in the data, like grouping or clustering of data points. Unsupervised learning can discover patterns in the data, and can group the inputs into categories, as in feature learning. Dimensionality reduction is the process of reducing the number of \"features\", or inputs, in a set of data.\nActive learning algorithms access the desired outputs (training labels) for a limited set of inputs based on a budget, and optimize the choice of inputs for which it will acquire training labels. When used interactively, these can be presented to a human user for labeling. Reinforcement learning algorithms are given feedback in the form of positive or negative reinforcement in a dynamic environment, and are used in autonomous vehicles or in learning to play a game against a human opponent. Other specialized algorithms in machine learning include topic modeling, where the computer program is given a set of natural language documents and finds other documents that cover similar topics. Machine learning algorithms can be used to find the unobservable probability density function in density estimation problems. Meta learning algorithms learn their own inductive bias based on previous experience. In developmental robotics, robot learning algorithms generate their own sequences of learning experiences, also known as a curriculum, to cumulatively acquire new skills through self-guided exploration and social interaction with humans. These robots use guidance mechanisms such as active learning, maturation, motor synergies, and imitation.\n\n\n== History and relationships to other fields ==\n\nArthur Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term \"Machine Learning\" in 1959 while at IBM. A representative book of the machine learning research during 1960s was the Nilsson's book on Learning Machines, dealing mostly with machine learning for pattern classification. The interest of machine learning related to pattern recognition continued during 1970s, as described in the book of Duda and Hart in 1973.  In 1981 a report was given on using teaching strategies so that a neural network learns to recognize 40 characters (26 letters, 10 digits, and 4 special symbols) from a computer terminal.  \nAs a scientific endeavor, machine learning grew out of the quest for artificial intelligence. Already in the early days of AI as an academic discipline, some researchers were interested in having machines learn from data. They attempted to approach the problem with various symbolic methods, as well as what were then termed \"neural networks\"; these were mostly perceptrons and other models that were later found to be reinventions of the generalized linear models of statistics. Probabilistic reasoning was also employed, especially in automated medical diagnosis.However, an increasing emphasis on the logical, knowledge-based approach caused a rift between AI and machine learning. Probabilistic systems were plagued by theoretical and practical problems of data acquisition and representation. By 1980, expert systems had come to dominate AI, and statistics was out of favor. Work on symbolic/knowledge-based learning did continue within AI, leading to inductive logic programming, but the more statistical line of research was now outside the field of AI proper, in pattern recognition and information retrieval. Neural networks research had been abandoned by AI and computer science around the same time. This line, too, was continued outside the AI/CS field, as \"connectionism\", by researchers from other disciplines including Hopfield, Rumelhart and Hinton. Their main success came in the mid-1980s with the reinvention of backpropagation.Machine learning, reorganized as a separate field, started to flourish in the 1990s. The field changed its goal from achieving artificial intelligence to tackling solvable problems of a practical nature. It shifted focus away from the symbolic approaches it had inherited from AI, and toward methods and models borrowed from statistics and probability theory. It also benefited from the increasing availability of digitized information, and the ability to distribute it via the Internet.\n\n\n=== Relation to data mining ===\nMachine learning and data mining often employ the same methods and overlap significantly, but while machine learning focuses on prediction, based on known properties learned from the training data, data mining focuses on the discovery of (previously) unknown properties in the data (this is the analysis step of knowledge discovery in databases). Data mining uses many machine learning methods, but with different goals; on the other hand, machine learning also employs data mining methods as \"unsupervised learning\" or as a preprocessing step to improve learner accuracy. Much of the confusion between these two research communities (which do often have separate conferences and separate journals, ECML PKDD being a major exception) comes from the basic assumptions they work with: in machine learning, performance is usually evaluated with respect to the ability to reproduce known knowledge, while in knowledge discovery and data mining (KDD) the key task is the discovery of previously unknown knowledge. Evaluated with respect to known knowledge, an uninformed (unsupervised) method will easily be outperformed by other supervised methods, while in a typical KDD task, supervised methods cannot be used due to the unavailability of training data.\n\n\n=== Relation to optimization ===\nMachine learning also has intimate ties to optimization: many learning problems are formulated as minimization of some loss function on a training set of examples. Loss functions express the discrepancy between the predictions of the model being trained and the actual problem instances (for example, in classification, one wants to assign a label to instances, and models are trained to correctly predict the pre-assigned labels of a set of examples). The difference between the two fields arises from the goal of generalization: while optimization algorithms can minimize the loss on a training set, machine learning is concerned with minimizing the loss on unseen samples.\n\n\n=== Relation to statistics ===\nMachine learning and statistics are closely related fields in terms of methods, but distinct in their principal goal: statistics draws population inferences from a sample, while machine learning finds generalizable predictive patterns. According to Michael I. Jordan, the ideas of machine learning, from methodological principles to theoretical tools, have had a long pre-history in statistics. He also suggested the term data science as a placeholder to call the overall field.Leo Breiman distinguished two statistical modelling paradigms: data model and algorithmic model, wherein \"algorithmic model\" means more or less the machine learning algorithms like Random forest.\nSome statisticians have adopted methods from machine learning, leading to a combined field that they call statistical learning.\n\n\n== Theory ==\n\nA core objective of a learner is to generalize from its experience. Generalization in this context is the ability of a learning machine to perform accurately on new, unseen examples/tasks after having experienced a learning data set. The training examples come from some generally unknown probability distribution (considered representative of the space of occurrences) and the learner has to build a general model about this space that enables it to produce sufficiently accurate predictions in new cases.\nThe computational analysis of machine learning algorithms and their performance is a branch of theoretical computer science known as computational learning theory. Because training sets are finite and the future is uncertain, learning theory usually does not yield guarantees of the performance of algorithms. Instead, probabilistic bounds on the performance are quite common. The bias\u2013variance decomposition is one way to quantify generalization error.\nFor the best performance in the context of generalization, the complexity of the hypothesis should match the complexity of the function underlying the data. If the hypothesis is less complex than the function, then the model has underfit the data. If the complexity of the model is increased in response, then the training error decreases. But if the hypothesis is too complex, then the model is subject to overfitting and generalization will be poorer.In addition to performance bounds, learning theorists study the time complexity and feasibility of learning. In computational learning theory, a computation is considered feasible if it can be done in polynomial time. There are two kinds of time complexity results. Positive results show that a certain class of functions can be learned in polynomial time. Negative results show that certain classes cannot be learned in polynomial time.\n\n\n== Approaches ==\n\n\n=== Types of learning algorithms ===\nThe types of machine learning algorithms differ in their approach, the type of data they input and output, and the type of task or problem that they are intended to solve.\n\n\n==== Supervised learning ====\n\nSupervised learning algorithms build a mathematical model of a set of data that contains both the inputs and the desired outputs. The data is known as training data, and consists of a set of training examples. Each training example has one or more inputs and a desired output, also known as a supervisory signal.  In the mathematical model, each training example is represented by an array or vector, sometimes called a feature vector, and the training data is represented by a matrix. Through iterative optimization of an objective function, supervised learning algorithms learn a function that can be used to predict the output associated with new inputs. An optimal function will allow the algorithm to correctly determine the output for inputs that were not a part of the training data. An algorithm that improves the accuracy of its outputs or predictions over time is said to have learned to perform that task.Supervised learning algorithms include classification and regression. Classification algorithms are used when the outputs are restricted to a limited set of values, and regression algorithms are used when the outputs may have any numerical value within a range. Similarity learning is an area of supervised machine learning closely related to regression and classification, but the goal is to learn from examples using a similarity function that measures how similar or related two objects are. It has applications in ranking, recommendation systems, visual identity tracking, face verification, and speaker verification.\nIn the case of semi-supervised learning algorithms, some of the training examples are missing training labels, but they can nevertheless be used to improve the quality of a model. In weakly supervised learning, the training labels are noisy, limited, or imprecise; however, these labels are often cheaper to obtain, resulting in larger effective training sets.\n\n\n==== Unsupervised learning ====\n\nUnsupervised learning algorithms take a set of data that contains only inputs, and find structure in the data, like grouping or clustering of data points. The algorithms therefore learn from test data that has not been labeled, classified or categorized. Instead of responding to feedback, unsupervised learning algorithms identify commonalities in the data and react based on the presence or absence of such commonalities in each new piece of data. A central application of unsupervised learning is in the field of density estimation in statistics, though unsupervised learning encompasses other domains involving summarizing and explaining data features.\nCluster analysis is the assignment of a set of observations into subsets (called clusters) so that observations within the same cluster are similar according to one or more predesignated criteria, while observations drawn from different clusters are dissimilar. Different clustering techniques make different assumptions on the structure of the data, often defined by some similarity metric and evaluated, for example, by internal compactness, or the similarity between members of the same cluster, and separation, the difference between clusters. Other methods are based on estimated density and graph connectivity.\nSemi-supervised Learning\n\nSemi-supervised learning falls between unsupervised learning (without any labeled training data) and supervised learning (with completely labeled training data).  Many machine-learning researchers have found that unlabeled data, when used in conjunction with a small amount of labeled data, can produce considerable improvement in learning accuracy.\n\n\n==== Reinforcement learning ====\n\nReinforcement learning is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward. Due to its generality, the field is studied in many other disciplines, such as game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, statistics and genetic algorithms. In machine learning, the environment is typically represented as a Markov Decision Process (MDP). Many reinforcement learning algorithms use dynamic programming techniques. Reinforcement learning algorithms do not assume knowledge of an exact mathematical model of the MDP, and are used when exact models are infeasible. Reinforcement learning algorithms are used in autonomous vehicles or in learning to play a game against a human opponent.\n\n\n==== Self learning ====\nSelf learning as machine learning paradigm was introduced in 1982 along with a neural network capable of self-learning  named Crossbar Adaptive Array (CAA).  It is a learning with no external rewards and no external teacher advices. The CAA self learning algorithm computes, in a crossbar fashion, both decisions about actions and emotions (feelings) about consequence situations. The system is driven by the interaction between cognition and emotion. \nThe self learning algorithm updates a memory matrix W =||w(a,s)|| such that in each iteration executes the following machine learning  routine: \n\n In situation s perform action a;\n Receive consequence situation s\u2019;\n Compute emotion of being in consequence situation v(s\u2019);\n Update crossbar memory  w\u2019(a,s) = w(a,s) + v(s\u2019).\n\nIt is a system with only one input, situation s, and only one output, action (or behavior) a. There is neither a separate reinforcement input nor an advice input from the environment. The backpropagated value (secondary reinforcement) is the emotion toward the consequence situation. The CAA exists in two environments, one is behavioral environment where it behaves, and the other is genetic environment, where from it initially and only once receives initial emotions about situations to be encountered in the  behavioral environment. After receiving the genome (species) vector from the genetic environment, the CAA learns a goal seeking  behavior, in an environment which contains both desirable and undesirable situations. \n\n\n==== Feature learning ====\n\nSeveral learning algorithms aim at discovering better representations of the inputs provided during training. Classic examples include principal components analysis and cluster analysis. Feature learning algorithms, also called representation learning algorithms, often attempt to preserve the information in their input but also transform it in a way that makes it useful, often as a pre-processing step before performing classification or predictions. This technique allows reconstruction of the inputs coming from the unknown data-generating distribution, while not being necessarily faithful to configurations that are implausible under that distribution. This replaces manual feature engineering, and allows a machine to both learn the features and use them to perform a specific task.\nFeature learning can be either supervised or unsupervised. In supervised feature learning, features are learned using labeled input data. Examples include artificial neural networks, multilayer perceptrons, and supervised dictionary learning. In unsupervised feature learning, features are learned with unlabeled input data.  Examples include dictionary learning, independent component analysis, autoencoders, matrix factorization and various forms of clustering.Manifold learning algorithms attempt to do so under the constraint that the learned representation is low-dimensional. Sparse coding algorithms attempt to do so under the constraint that the learned representation is sparse, meaning that the mathematical model has many zeros. Multilinear subspace learning algorithms aim to learn low-dimensional representations directly from tensor representations for multidimensional data, without reshaping them into higher-dimensional vectors. Deep learning algorithms discover multiple levels of representation, or a hierarchy of features, with higher-level, more abstract features defined in terms of (or generating) lower-level features. It has been argued that an intelligent machine is one that learns a representation that disentangles the underlying factors of variation that explain the observed data.Feature learning is motivated by the fact that machine learning tasks such as classification often require input that is mathematically and computationally convenient to process. However, real-world data such as images, video, and sensory data has not yielded to attempts to algorithmically define specific features. An alternative is to discover such features or representations through examination, without relying on explicit algorithms.\n\n\n==== Sparse dictionary learning ====\n\nSparse dictionary learning is a feature learning method where a training example is represented as a linear combination of basis functions, and is assumed to be a sparse matrix. The method is strongly NP-hard and difficult to solve approximately. A popular heuristic method for sparse dictionary learning is the K-SVD algorithm. Sparse dictionary learning has been applied in several contexts. In classification, the problem is to determine the class to which a previously unseen training example belongs. For a dictionary where each class has already been built, a new training example is associated with the class that is best sparsely represented by the corresponding dictionary. Sparse dictionary learning has also been applied in image de-noising. The key idea is that a clean image patch can be sparsely represented by an image dictionary, but the noise cannot.\n\n\n==== Anomaly detection ====\n\nIn data mining, anomaly detection, also known as outlier detection, is the identification of rare items, events or observations which raise suspicions by differing significantly from the majority of the data. Typically, the anomalous items represent an issue such as bank fraud, a structural defect, medical problems or errors in a text. Anomalies are referred to as outliers, novelties, noise, deviations and exceptions.In particular, in the context of abuse and network intrusion detection, the interesting objects are often not rare objects, but unexpected bursts in activity. This pattern does not adhere to the common statistical definition of an outlier as a rare object, and many outlier detection methods (in particular, unsupervised algorithms) will fail on such data, unless it has been aggregated appropriately. Instead, a cluster analysis algorithm may be able to detect the micro-clusters formed by these patterns.Three broad categories of anomaly detection techniques exist. Unsupervised anomaly detection techniques detect anomalies in an unlabeled test data set under the assumption that the majority of the instances in the data set are normal, by looking for instances that seem to fit least to the remainder of the data set. Supervised anomaly detection techniques require a data set that has been labeled as \"normal\" and \"abnormal\" and involves training a classifier (the key difference to many other statistical classification problems is the inherent unbalanced nature of outlier detection). Semi-supervised anomaly detection techniques construct a model representing normal behavior from a given normal training data set, and then test the likelihood of a test instance to be generated by the model.\n\n\n==== Association rules ====\n\nAssociation rule learning is a rule-based machine learning method for discovering relationships between variables in large databases. It is intended to identify strong rules discovered in databases using some measure of \"interestingness\".Rule-based machine learning is a general term for any machine learning method that identifies, learns, or evolves \"rules\" to store, manipulate or apply knowledge. The defining characteristic of a rule-based machine learning algorithm is the identification and utilization of a set of relational rules that collectively represent the knowledge captured by the system. This is in contrast to other machine learning algorithms that commonly identify a singular model that can be universally applied to any instance in order to make a prediction. Rule-based machine learning approaches include learning classifier systems, association rule learning, and artificial immune systems.\nBased on the concept of strong rules, Rakesh Agrawal, Tomasz Imieli\u0144ski and Arun Swami introduced association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets. For example, the rule \n  \n    \n      \n        {\n        \n          o\n          n\n          i\n          o\n          n\n          s\n          ,\n          p\n          o\n          t\n          a\n          t\n          o\n          e\n          s\n        \n        }\n        \u21d2\n        {\n        \n          b\n          u\n          r\n          g\n          e\n          r\n        \n        }\n      \n    \n    {\\displaystyle \\{\\mathrm {onions,potatoes} \\}\\Rightarrow \\{\\mathrm {burger} \\}}\n   found in the sales data of a supermarket would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat. Such information can be used as the basis for decisions about marketing activities such as promotional pricing or product placements. In addition to market basket analysis, association rules are employed today in application areas including Web usage mining, intrusion detection, continuous production, and bioinformatics. In contrast with sequence mining, association rule learning typically does not consider the order of items either within a transaction or across transactions.\nLearning classifier systems (LCS) are a family of rule-based machine learning algorithms that combine a discovery component, typically a genetic algorithm, with a learning component, performing either supervised learning, reinforcement learning, or unsupervised learning. They seek to identify a set of context-dependent rules that collectively store and apply knowledge in a piecewise manner in order to make predictions.Inductive logic programming (ILP) is an approach to rule-learning using logic programming as a uniform representation for input examples, background knowledge, and hypotheses. Given an encoding of the known background knowledge and a set of examples represented as a logical database of facts, an ILP system will derive a hypothesized logic program that entails all positive and no negative examples. Inductive programming is a related field that considers any kind of programming languages for representing hypotheses (and not only logic programming), such as functional programs.\nInductive logic programming is particularly useful in bioinformatics and natural language processing. Gordon Plotkin and Ehud Shapiro laid the initial theoretical foundation for inductive machine learning in a logical setting. Shapiro built their first implementation (Model Inference System) in 1981: a Prolog program that inductively inferred logic programs from positive and negative examples. The term inductive here refers to philosophical induction, suggesting a theory to explain observed facts, rather than mathematical induction, proving a property for all members of a well-ordered set.\n\n\n=== Models ===\nPerforming machine learning involves creating a model, which is trained on some training data and then can process additional data to make predictions. Various types of models have been used and researched for machine learning systems.\n\n\n==== Artificial neural networks ====\n\nArtificial neural networks (ANNs), or connectionist systems, are computing systems vaguely inspired by the biological neural networks that constitute animal brains. Such systems \"learn\" to perform tasks by considering examples, generally without being programmed with any task-specific rules.\nAn ANN is a model based on a collection of connected units or nodes called \"artificial neurons\", which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit information, a \"signal\", from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it. In common ANN implementations, the signal at a connection between artificial neurons is a real number, and the output of each artificial neuron is computed by some non-linear function of the sum of its inputs. The connections between artificial neurons are called \"edges\". Artificial neurons and edges typically have a weight that adjusts as learning proceeds. The weight increases or decreases the strength of the signal at a connection. Artificial neurons may have a threshold such that the signal is only sent if the aggregate signal crosses that threshold. Typically, artificial neurons are aggregated into layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first layer (the input layer), to the last layer (the output layer), possibly after traversing the layers multiple times.\nThe original goal of the ANN approach was to solve problems in the same way that a human brain would. However, over time, attention moved to performing specific tasks, leading to deviations from biology. Artificial neural networks have been used on a variety of tasks, including computer vision, speech recognition, machine translation, social network filtering, playing board and video games and medical diagnosis.\nDeep learning consists of multiple hidden layers in an artificial neural network. This approach tries to model the way the human brain processes light and sound into vision and hearing. Some successful applications of deep learning are computer vision and speech recognition.\n\n\n==== Decision trees ====\n\nDecision tree learning uses a decision tree as a predictive model to go from observations about an item (represented in the branches) to conclusions about the item's target value (represented in the leaves). It is one of the predictive modeling approaches used in statistics, data mining and machine learning. Tree models where the target variable can take a discrete set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. In data mining, a decision tree describes data, but the resulting classification tree can be an input for decision making.\n\n\n==== Support vector machines ====\n\nSupport vector machines (SVMs), also known as support vector networks, are a set of related supervised learning methods used for classification and regression. Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that predicts whether a new example falls into one category or the other.  An SVM training algorithm is a non-probabilistic, binary, linear classifier, although methods such as Platt scaling exist to use SVM in a probabilistic classification setting. In addition to performing linear classification, SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces.\n\n\n==== Regression analysis ====\n\nRegression analysis encompasses a large variety of statistical methods to estimate the relationship between input variables and their associated features. Its most common form is linear regression, where a single line is drawn to best fit the given data according to a mathematical criterion such as ordinary least squares. The latter is oftentimes extended by regularization (mathematics) methods to mitigate overfitting and high bias, as can be seen in ridge regression. When dealing with non-linear problems, go-to models include polynomial regression (e.g. used for trendline fitting in Microsoft Excel ), Logistic regression (often used in statistical classification) or even kernel regression, which introduces non-linearity by taking advantage of the kernel trick to implicitly map input variables to higher dimensional space. \n\n\n==== Bayesian networks ====\n\nA Bayesian network, belief network or directed acyclic graphical model is a probabilistic graphical model that represents a set of random variables and their conditional independence with a directed acyclic graph (DAG). For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases. Efficient algorithms exist that perform inference and learning. Bayesian networks that model sequences of variables, like speech signals or protein sequences, are called dynamic Bayesian networks. Generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams.\n\n\n==== Genetic algorithms ====\n\nA genetic algorithm (GA) is a search algorithm and heuristic technique that mimics the process of natural selection, using methods such as mutation and crossover to generate new genotypes in the hope of finding good solutions to a given problem. In machine learning, genetic algorithms were used in the 1980s and 1990s. Conversely, machine learning techniques have been used to improve the performance of genetic and evolutionary algorithms.\n\n\n=== Training models ===\nUsually, machine learning models require a lot of data in order for them to perform well. Usually, when training a machine learning model, one needs to collect a large, representative sample of data from a training set. Data from the training set can be as varied as a corpus of text, a collection of images, and data collected from individual users of a service. Overfitting is something to watch out for when training a machine learning model.\n\n\n==== Federated learning ====\n\nFederated learning is a new approach to training machine learning models that decentralizes the training process, allowing for users' privacy to be maintained by not needing to send their data to a centralized server. This also increases efficiency by decentralizing the training process to many devices. For example, Gboard uses federated machine learning to train search query prediction models on users' mobile phones without having to send individual searches back to Google.\n\n\n== Applications ==\nThere are many applications for machine learning, including:\n\nIn 2006, the media-services provider Netflix held the first \"Netflix Prize\" competition to find a program to better predict user preferences and improve the accuracy on its existing Cinematch movie recommendation algorithm by at least 10%.  A joint team made up of researchers from AT&T Labs-Research in collaboration with the teams Big Chaos and Pragmatic Theory built an ensemble model to win the Grand Prize in 2009 for $1 million. Shortly after the prize was awarded, Netflix realized that viewers' ratings were not the best indicators of their viewing patterns (\"everything is a recommendation\") and they changed their recommendation engine accordingly. In 2010 The Wall Street Journal wrote about the firm Rebellion Research and their use of machine learning to predict the financial crisis. In 2012, co-founder of Sun Microsystems, Vinod Khosla, predicted that 80% of medical doctors' jobs would be lost in the next two decades to automated machine learning medical diagnostic software. In 2014, it was reported that a machine learning algorithm had been applied in the field of art history to study fine art paintings, and that it may have revealed previously unrecognized influences among artists. In 2019 Springer Nature published the first research book created using machine learning.\n\n\n== Limitations ==\nAlthough machine learning has been transformative in some fields, machine-learning programs often fail to deliver expected results. Reasons for this are numerous: lack of (suitable) data, lack of access to the data, data bias, privacy problems, badly chosen tasks and algorithms, wrong tools and people, lack of resources, and evaluation problems.In 2018, a self-driving car from Uber failed to detect a pedestrian, who was killed after a collision. Attempts to use machine learning in healthcare with the IBM Watson system failed to deliver even after years of time and billions of investment.\n\n\n=== Bias ===\n\nMachine learning approaches in particular can suffer from different data biases. A machine learning system trained on current customers only may not be able to predict the needs of new customer groups that are not represented in the training data. When trained on man-made data, machine learning is likely to pick up the same constitutional and unconscious biases already present in society. Language models learned from data have been shown to contain human-like biases. Machine learning systems used for criminal risk assessment have been found to be biased against black people. In 2015, Google photos would often tag black people as gorillas, and in 2018 this still was not well resolved, but Google reportedly was still using the workaround to remove all gorillas from the training data, and thus was not able to recognize real gorillas at all. Similar issues with recognizing non-white people have been found in many other systems. In 2016, Microsoft tested a chatbot that learned from Twitter, and it quickly picked up racist and sexist language. Because of such challenges, the effective use of machine learning may take longer to be adopted in other domains. Concern for reducing bias in machine learning and propelling its use for human good is increasingly expressed by artificial intelligence scientists, including Fei-Fei Li, who reminds engineers that \"There\u2019s nothing artificial about AI...It\u2019s inspired by people, it\u2019s created by people, and\u2014most importantly\u2014it impacts people. It is a powerful tool we are only just beginning to understand, and that is a profound responsibility.\u201d\n\n\n== Model assessments ==\nClassification machine learning models can be validated by accuracy estimation techniques like the Holdout method, which splits the data in a training and test set (conventionally 2/3 training set and 1/3 test set designation) and evaluates the performance of the training model on the test set. In comparison, the K-fold-cross-validation method randomly partitions the data into K subsets and then K experiments are performed each respectively considering 1 subset for evaluation and the remaining K-1 subsets for training the model. In addition to the holdout and cross-validation methods, bootstrap, which samples n instances with replacement from the dataset, can be used to assess model accuracy.In addition to overall accuracy, investigators frequently report sensitivity and specificity meaning True Positive Rate (TPR) and True Negative Rate (TNR) respectively. Similarly, investigators sometimes report the False Positive Rate (FPR) as well as the False Negative Rate (FNR). However, these rates are ratios that fail to reveal their numerators and denominators. The Total Operating Characteristic (TOC) is an effective method to express a model's diagnostic ability. TOC shows the numerators and denominators of the previously mentioned rates, thus TOC provides more information than the commonly used Receiver Operating Characteristic (ROC) and ROC's associated Area Under the Curve (AUC).\n\n\n== Ethics ==\nMachine learning poses a host of ethical questions. Systems which are trained on datasets collected with biases may exhibit these biases upon use (algorithmic bias), thus digitizing cultural prejudices. For example, using job hiring data from a firm with racist hiring policies may lead to a machine learning system duplicating the bias by scoring job applicants against similarity to previous successful applicants. Responsible collection of data and documentation of algorithmic rules used by a system thus is a critical part of machine learning.\nBecause human languages contain biases, machines trained on language corpora will necessarily also learn these biases.Other forms of ethical challenges, not related to personal biases, are more seen in health care. There are concerns among health care professionals that these systems might not be designed in the public's interest, but as income generating machines. This is especially true in the United States where there is a perpetual ethical dilemma of improving health care, but also increasing profits. For example, the algorithms could be designed to provide patients with unnecessary tests or medication in which the algorithm's proprietary owners hold stakes in. There is huge potential for machine learning in health care to provide professionals a great tool to diagnose, medicate, and even plan recovery paths for patients, but this will not happen until the personal biases mentioned previously, and these \"greed\" biases are addressed.\n\n\n== Software ==\nSoftware suites containing a variety of machine learning algorithms include the following:\n\n\n=== Free and open-source software ===\n\n\n=== Proprietary software with free and open-source editions ===\n\n\n=== Proprietary software ===\n\n\n== Journals ==\nJournal of Machine Learning Research\nMachine Learning\nNature Machine Intelligence\nNeural Computation\n\n\n== Conferences ==\nConference on Neural Information Processing Systems\nInternational Conference on Machine Learning\n\n\n== See also ==\n\n\n== References ==\n\n\n== Further reading ==\n\n\n== External links ==\nInternational Machine Learning Society\nmloss is an academic database of open-source machine learning software.\nMachine Learning Crash Course by Google. This is a free course on machine learning through the use of TensorFlow."}, {"job": "data scientist", "skill": "statistics", "keywords": "statistics", "description": "Statistics is the discipline that concerns the collection, organization, displaying, analysis, interpretation and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied. Populations can be diverse groups of people or objects such as \"all people living in a country\" or \"every atom composing a crystal\". Statistics deals with every aspect of data, including the planning of data collection in terms of the design of surveys and experiments.\nSee glossary of probability and statistics.\nWhen census data cannot be collected, statisticians collect data by developing specific experiment designs and survey samples. Representative sampling assures that inferences and conclusions can reasonably extend from the sample to the population as a whole. An experimental study involves taking measurements of the system under study, manipulating the system, and then taking additional measurements using the same procedure to determine if the manipulation has modified the values of the measurements. In contrast, an observational study does not involve experimental manipulation.\nTwo main statistical methods are used in data analysis: descriptive statistics, which summarize data from a sample using indexes such as the mean or standard deviation, and inferential statistics, which draw conclusions from data that are subject to random variation (e.g., observational errors, sampling variation). Descriptive statistics are most often concerned with two sets of properties of a distribution (sample or population): central tendency (or location) seeks to characterize the distribution's central or typical value, while dispersion (or variability) characterizes the extent to which members of the distribution depart from its center and each other. Inferences on mathematical statistics are made under the framework of probability theory, which deals with the analysis of random phenomena.\nA standard statistical procedure involves the test of the relationship between two statistical data sets, or a data set and synthetic data drawn from an idealized model. A hypothesis is proposed for the statistical relationship between the two data sets, and this is compared as an alternative to an idealized null hypothesis of no relationship between two data sets. Rejecting or disproving the null hypothesis is done using statistical tests that quantify the sense in which the null can be proven false, given the data that are used in the test. Working from a null hypothesis, two basic forms of error are recognized: Type I errors (null hypothesis is falsely rejected giving a \"false positive\") and Type II errors (null hypothesis fails to be rejected and an actual relationship between populations is missed giving a \"false negative\"). Multiple problems have come to be associated with this framework: ranging from obtaining a sufficient sample size to specifying an adequate null hypothesis.Measurement processes that generate statistical data are also subject to error. Many of these errors are classified as random (noise) or systematic (bias), but other types of errors (e.g., blunder, such as when an analyst reports incorrect units) can also occur. The presence of missing data or censoring may result in biased estimates and specific techniques have been developed to address these problems.\nThe earliest writings on probability and statistics, statistical methods drawing from probability theory, date back to Arab mathematicians and cryptographers, notably Al-Khalil (717\u2013786) and Al-Kindi (801\u2013873). In the 18th century, statistics also started to draw heavily from calculus. In more recent years statistics has relied more on statistical software to produce tests such as descriptive analysis.\n\n\n== Introduction ==\n\nStatistics is a mathematical body of science that pertains to the collection, analysis, interpretation or explanation, and presentation of data, or as a branch of mathematics. Some consider statistics to be a distinct mathematical science rather than a branch of mathematics. While many scientific investigations make use of data, statistics is concerned with the use of data in the context of uncertainty and decision making in the face of uncertainty.In applying statistics to a problem, it is common practice to start with a population or process to be studied. Populations can be diverse topics such as \"all people living in a country\" or \"every atom composing a crystal\". Ideally, statisticians compile data about the entire population (an operation called census). This may be organized by governmental statistical institutes. Descriptive statistics can be used to summarize the population data. Numerical descriptors include mean and standard deviation for continuous data types (like income), while frequency and percentage are more useful in terms of describing categorical data (like education).\nWhen a census is not feasible, a chosen subset of the population called a sample is studied. Once a sample that is representative of the population is determined, data is collected for the sample members in an observational or experimental setting. Again, descriptive statistics can be used to summarize the sample data. However, the drawing of the sample has been subject to an element of randomness, hence the established numerical descriptors from the sample are also due to uncertainty. To still draw meaningful conclusions about the entire population, inferential statistics is needed. It uses patterns in the sample data to draw inferences about the population represented, accounting for randomness. These inferences may take the form of: answering yes/no questions about the data (hypothesis testing), estimating numerical characteristics of the data (estimation), describing associations within the data (correlation) and modeling relationships within the data (for example, using regression analysis).  Inference can extend to forecasting, prediction and estimation of unobserved values either in or associated with the population being studied; it can include extrapolation and interpolation of time series or spatial data, and can also include data mining.\n\n\n=== Mathematical statistics ===\n\nMathematical statistics is the application of mathematics to statistics. Mathematical techniques used for this include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure-theoretic probability theory.\n\n\n== History ==\n\nThe earliest writings on probability and statistics date back to Arab mathematicians and cryptographers, during the Islamic Golden Age between the 8th and 13th centuries. Al-Khalil (717\u2013786) wrote the Book of Cryptographic Messages, which contains the first use of permutations and combinations, to list all possible Arabic words with and without vowels. The earliest book on statistics is the 9th-century treatise Manuscript on Deciphering Cryptographic Messages, written by Arab scholar Al-Kindi (801\u2013873). In his book, Al-Kindi gave a detailed description of how to use statistics and frequency analysis to decipher encrypted messages. This text laid the foundations for statistics and cryptanalysis. Al-Kindi also made the earliest known use of statistical inference, while he and later Arab cryptographers developed the early statistical methods for decoding encrypted messages. Ibn Adlan (1187\u20131268) later made an important contribution, on the use of sample size in frequency analysis.The earliest European writing on statistics dates back to 1663, with the publication of Natural and Political Observations upon the Bills of Mortality by John Graunt. Early applications of statistical thinking revolved around the needs of states to base policy on demographic and economic data, hence its stat- etymology. The scope of the discipline of statistics broadened in the early 19th century to include the collection and analysis of data in general. Today, statistics is widely employed in government, business, and natural and social sciences.\nThe mathematical foundations of modern statistics were laid in the 17th century with the development of the probability theory by Gerolamo Cardano, Blaise Pascal and Pierre de Fermat. Mathematical probability theory arose from the study of games of chance, although the concept of probability was already examined in medieval law and by philosophers such as Juan Caramuel. The method of least squares was first described by  Adrien-Marie Legendre in 1805.\n\nThe modern field of statistics emerged in the late 19th and early 20th century in three stages. The first wave, at the turn of the century, was led by the work of Francis Galton and Karl Pearson, who transformed statistics into a rigorous mathematical discipline used for analysis, not just in science, but in industry and politics as well. Galton's contributions included introducing the concepts of standard deviation, correlation, regression analysis and the application of these methods to the study of the variety of human characteristics\u2014height, weight, eyelash length among others. Pearson developed the Pearson product-moment correlation coefficient, defined as a product-moment, the method of moments for the fitting of distributions to samples and the Pearson distribution, among many other things. Galton and Pearson founded Biometrika as the first journal of mathematical statistics and biostatistics (then called biometry), and the latter founded the world's first university statistics department at University College London.Ronald Fisher coined the term null hypothesis during the Lady tasting tea experiment, which \"is never proved or established, but is possibly disproved, in the course of experimentation\".The second wave of the 1910s and 20s was initiated by William Sealy Gosset, and reached its culmination in the insights of Ronald Fisher, who wrote the textbooks that were to define the academic discipline in universities around the world. Fisher's most important publications were his 1918 seminal paper The Correlation between Relatives on the Supposition of Mendelian Inheritance, which was the first to use the statistical term, variance, his classic 1925 work Statistical Methods for Research Workers and his 1935 The Design of Experiments, where he developed rigorous design of experiments models. He originated the concepts of sufficiency, ancillary statistics, Fisher's linear discriminator and Fisher information. In his 1930 book The Genetical Theory of Natural Selection he applied statistics to various biological concepts such as Fisher's principle). Nevertheless, A.W.F. Edwards has remarked that it is \"probably the most celebrated argument in evolutionary biology\". (about the sex ratio), the Fisherian runaway, a concept in sexual selection about a positive feedback runaway affect found in evolution.\nThe final wave, which mainly saw the refinement and expansion of earlier developments, emerged from the collaborative work between Egon Pearson and Jerzy Neyman in the 1930s. They introduced the concepts of \"Type II\" error, power of a test and confidence intervals. Jerzy Neyman in 1934 showed that stratified random sampling was in general a better method of estimation than purposive (quota) sampling.Today, statistical methods are applied in all fields that involve decision making, for making accurate inferences from a collated body of data and for making decisions in the face of uncertainty based on statistical methodology. The use of modern computers has expedited large-scale statistical computations and has also made possible new methods that are impractical to perform manually. Statistics continues to be an area of active research for example on the problem of how to analyze Big data.\n\n\n== Statistical data ==\n\n\n=== Data collection ===\n\n\n==== Sampling ====\nWhen full census data cannot be collected, statisticians collect sample data by developing specific experiment designs and survey samples. Statistics itself also provides tools for prediction and forecasting through statistical models. The idea of making inferences based on sampled data began around the mid-1600s in connection with estimating populations and developing precursors of life insurance.To use a sample as a guide to an entire population, it is important that it truly represents the overall population. Representative sampling assures that inferences and conclusions can safely extend from the sample to the population as a whole. A major problem lies in determining the extent that the sample chosen is actually representative. Statistics offers methods to estimate and correct for any bias within the sample and data collection procedures. There are also methods of experimental design for experiments that can lessen these issues at the outset of a study, strengthening its capability to discern truths about the population.\nSampling theory is part of the mathematical discipline of probability theory. Probability is used in mathematical statistics to study the sampling distributions of sample statistics and, more generally, the properties of statistical procedures. The use of any statistical method is valid when the system or population under consideration satisfies the assumptions of the method.\nThe difference in point of view between classic probability theory and sampling theory is, roughly, that probability theory starts from the given parameters of a total population to deduce probabilities that pertain to samples. Statistical inference, however, moves in the opposite direction\u2014inductively inferring from samples to the parameters of a larger or total population.\n\n\n==== Experimental and observational studies ====\nA common goal for a statistical research project is to investigate causality, and in particular to draw a conclusion on the effect of changes in the values of predictors or independent variables on dependent variables. There are two major types of causal statistical studies: experimental studies and observational studies. In both types of studies, the effect of differences of an independent variable (or variables) on the behavior of the dependent variable are observed. The difference between the two types lies in how the study is actually conducted. Each can be very effective.\nAn experimental study involves taking measurements of the system under study, manipulating the system, and then taking additional measurements using the same procedure to determine if the manipulation has modified the values of the measurements. In contrast, an observational study does not involve experimental manipulation. Instead, data are gathered and correlations between predictors and response are investigated.\nWhile the tools of data analysis work best on data from randomized studies, they are also applied to other kinds of data\u2014like natural experiments and observational studies\u2014for which a statistician would use a modified, more structured estimation method (e.g., Difference in differences estimation and instrumental variables, among many others) that produce consistent estimators.\n\n\n===== Experiments =====\nThe basic steps of a statistical experiment are:\n\nPlanning the research, including finding the number of replicates of the study, using the following information:  preliminary estimates regarding the size of treatment effects, alternative hypotheses, and the estimated experimental variability. Consideration of the selection of experimental subjects and the ethics of research is necessary. Statisticians recommend that experiments compare (at least) one new treatment with a standard treatment or control, to allow an unbiased estimate of the difference in treatment effects.\nDesign of experiments, using blocking to reduce the influence of confounding variables, and  randomized assignment of treatments to subjects to allow unbiased estimates of treatment effects and experimental error. At this stage, the experimenters and statisticians write the experimental protocol that will guide the performance of the experiment and which specifies the primary analysis of the experimental data.\nPerforming the experiment following the experimental protocol and analyzing the data following the experimental protocol.\nFurther examining the data set in secondary analyses, to suggest new hypotheses for future study.\nDocumenting and presenting the results of the study.Experiments on human behavior have special concerns. The famous Hawthorne study examined changes to the working environment at the Hawthorne plant of the Western Electric Company. The researchers were interested in determining whether increased illumination would increase the productivity of the assembly line workers. The researchers first measured the productivity in the plant, then modified the illumination in an area of the plant and checked if the changes in illumination affected productivity. It turned out that productivity indeed improved (under the experimental conditions). However, the study is heavily criticized today for errors in experimental procedures, specifically for the lack of a control group and blindness. The Hawthorne effect refers to finding that an outcome (in this case, worker productivity) changed due to observation itself. Those in the Hawthorne study became more productive not because the lighting was changed but because they were being observed.\n\n\n===== Observational study =====\nAn example of an observational study is one that explores the association between smoking and lung cancer. This type of study typically uses a survey to collect observations about the area of interest and then performs statistical analysis. In this case, the researchers would collect observations of both smokers and non-smokers, perhaps through a cohort study, and then look for the number of cases of lung cancer in each group. A case-control study is another type of observational study in which people with and without the outcome of interest (e.g. lung cancer) are invited to participate and their exposure histories are collected.\n\n\n=== Types of data ===\n\nVarious attempts have been made to produce a taxonomy of levels of measurement. The psychophysicist Stanley Smith Stevens defined nominal, ordinal, interval, and ratio scales. Nominal measurements do not have meaningful rank order among values, and permit any one-to-one (injective) transformation. Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values, and permit any order-preserving transformation. Interval measurements have meaningful distances between measurements defined, but the zero value is arbitrary (as in the case with longitude and temperature measurements in Celsius or Fahrenheit), and permit any linear transformation. Ratio measurements have both a meaningful zero value and the distances between different measurements defined, and permit any rescaling transformation.\nBecause variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are grouped together as categorical variables, whereas ratio and interval measurements are grouped together as quantitative variables, which can be either discrete or continuous, due to their numerical nature. Such distinctions can often be loosely correlated with data type in computer science, in that dichotomous categorical variables may be represented with the Boolean data type, polytomous categorical variables with arbitrarily assigned integers in the integral data type, and continuous variables with the real data type involving floating point computation. But the mapping of computer science data types to statistical data types depends on which categorization of the latter is being implemented.\nOther categorizations have been proposed. For example, Mosteller and Tukey (1977) distinguished grades, ranks, counted fractions, counts, amounts, and balances. Nelder (1990) described continuous counts, continuous ratios, count ratios, and categorical modes of data. See also Chrisman (1998), van den Berg (1991).The issue of whether or not it is appropriate to apply different kinds of statistical methods to data obtained from different kinds of measurement procedures is complicated by issues concerning the transformation of variables and the precise interpretation of research questions. \"The relationship between the data and what they describe merely reflects the fact that certain kinds of statistical statements may have truth values which are not invariant under some transformations. Whether or not a transformation is sensible to contemplate depends on the question one is trying to answer\" (Hand, 2004, p. 82).\n\n\n== Statistical methods ==\n\n\n=== Descriptive statistics ===\n\nA descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features of a collection of information, while descriptive statistics in the mass noun sense is the process of using and analyzing those statistics. Descriptive statistics is distinguished from inferential statistics (or inductive statistics), in that descriptive statistics aims to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent.\n\n\n=== Inferential statistics ===\n\nStatistical inference is the process of using data analysis to deduce properties of an underlying probability distribution. Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates.  It is assumed that the observed data set is sampled from a larger population. Inferential statistics can be contrasted with descriptive statistics. Descriptive statistics is solely concerned with properties of the observed data, and it does not rest on the assumption that the data come from a larger population.\n\n\n==== Terminology and theory of inferential statistics ====\n\n\n===== Statistics, estimators and pivotal quantities =====\nConsider independent identically distributed (IID) random variables with a given probability distribution: standard statistical inference and estimation theory defines a random sample as the random vector given by the column vector of these IID variables. The population being examined is described by a probability distribution that may have unknown parameters.\nA statistic is a random variable that is a function of the random sample, but not a function of unknown parameters. The probability distribution of the statistic, though, may have unknown parameters.\nConsider now a function of the unknown parameter: an estimator is a statistic used to estimate such function. Commonly used estimators include sample mean, unbiased sample variance and sample covariance.\nA random variable that is a function of the random sample and of the unknown parameter, but whose probability distribution does not depend on the unknown parameter is called a pivotal quantity or pivot. Widely used pivots include the z-score, the chi square statistic and Student's t-value.\nBetween two estimators of a given parameter, the one with lower mean squared error is said to be more efficient. Furthermore, an estimator is said to be unbiased if its expected value is equal to the true value of the unknown parameter being estimated, and asymptotically unbiased if its expected value converges at the limit to the true value of such parameter.\nOther desirable properties for estimators include: UMVUE estimators that have the lowest variance for all possible values of the parameter to be estimated (this is usually an easier property to verify than efficiency) and consistent estimators which converges in probability to the true value of such parameter.\nThis still leaves the question of how to obtain estimators in a given situation and carry the computation, several methods have been proposed: the method of moments, the maximum likelihood method, the least squares method and the more recent method of estimating equations.\n\n\n===== Null hypothesis and alternative hypothesis =====\nInterpretation of statistical information can often involve the development of a null hypothesis which is usually (but not necessarily) that no relationship exists among variables or that no change occurred over time.The best illustration for a novice is the predicament encountered by a criminal trial. The null hypothesis, H0, asserts that the defendant is innocent, whereas the alternative hypothesis, H1, asserts that the defendant is guilty. The indictment comes because of suspicion of the guilt. The H0 (status quo) stands in opposition to H1 and is maintained unless H1 is supported by evidence \"beyond a reasonable doubt\". However, \"failure to reject H0\" in this case does not imply innocence, but merely that the evidence was insufficient to convict. So the jury does not necessarily accept H0 but fails to reject H0. While one can not \"prove\" a null hypothesis, one can test how close it is to being true with a power test, which tests for type II errors.\nWhat statisticians call an alternative hypothesis is simply a hypothesis that contradicts the null hypothesis.\n\n\n===== Error =====\nWorking from a null hypothesis, two basic forms of error are recognized:\n\nType I errors where the null hypothesis is falsely rejected giving a \"false positive\".\nType II errors where the null hypothesis fails to be rejected and an actual difference between populations is missed giving a \"false negative\".Standard deviation refers to the extent to which individual observations in a sample differ from a central value, such as the sample or population mean, while Standard error refers to an estimate of difference between sample mean and population mean.\nA statistical error is the amount by which an observation differs from its expected value, a residual is the amount an observation differs from the value the estimator of the expected value assumes on a given sample (also called prediction).\nMean squared error is used for obtaining efficient estimators, a widely used class of estimators. Root mean square error is simply the square root of mean squared error.\n\nMany statistical methods seek to minimize the residual sum of squares, and these are called \"methods of least squares\" in contrast to Least absolute deviations. The latter gives equal weight to small and big errors, while the former gives more weight to large errors. Residual sum of squares is also differentiable, which provides a handy property for doing regression. Least squares applied to linear regression is called ordinary least squares method and least squares applied to nonlinear regression is called non-linear least squares. Also in a linear regression model the non deterministic part of the model is called error term, disturbance or more simply noise. Both linear regression and non-linear regression are addressed in polynomial least squares, which also describes the variance in a prediction of the dependent variable (y axis) as a function of the independent variable (x axis) and the deviations (errors, noise, disturbances) from the estimated (fitted) curve.\nMeasurement processes that generate statistical data are also subject to error.  Many of these errors are classified as random (noise) or systematic (bias), but other types of errors (e.g., blunder, such as when an analyst reports incorrect units) can also be important. The presence of missing data or censoring may result in biased estimates and specific techniques have been developed to address these problems.\n\n\n===== Interval estimation =====\n\nMost studies only sample part of a population, so results don't fully represent the whole population. Any estimates obtained from the sample only approximate the population value. Confidence intervals allow statisticians to express how closely the sample estimate matches the true value in the whole population. Often they are expressed as 95% confidence intervals. Formally, a 95% confidence interval for a value is a range where, if the sampling and analysis were repeated under the same conditions (yielding a different dataset), the interval would include the true (population) value in 95% of all possible cases. This does not imply that the probability that the true value is in the confidence interval is 95%. From the frequentist perspective, such a claim does not even make sense, as the true value is not a random variable.  Either the true value is or is not within the given interval. However, it is true that, before any data are sampled and given a plan for how to construct the confidence interval, the probability is 95% that the yet-to-be-calculated interval will cover the true value: at this point, the limits of the interval are yet-to-be-observed random variables. One approach that does yield an interval that can be interpreted as having a given probability of containing the true value is to use a credible interval from Bayesian statistics: this approach depends on a different way of interpreting what is meant by \"probability\", that is as a Bayesian probability.\nIn principle confidence intervals can be symmetrical or asymmetrical. An interval can be asymmetrical because it works as lower or upper bound for a parameter (left-sided interval or right sided interval), but it can also be asymmetrical because the two sided interval is built violating symmetry around the estimate. Sometimes the bounds for a confidence interval are reached asymptotically and these are used to approximate the true bounds.\n\n\n===== Significance =====\n\nStatistics rarely give a simple Yes/No type answer to the question under analysis. Interpretation often comes down to the level of statistical significance applied to the numbers and often refers to the probability of a value accurately rejecting the null hypothesis (sometimes referred to as the p-value).\n\nThe standard approach is to test a null hypothesis against an alternative hypothesis. A critical region is the set of values of the estimator that leads to refuting the null hypothesis. The probability of type I error is therefore the probability that the estimator belongs to the critical region given that null hypothesis is true (statistical significance) and the probability of type II error is the probability that the estimator doesn't belong to the critical region given that the alternative hypothesis is true. The statistical power of a test is the probability that it correctly rejects the null hypothesis when the null hypothesis is false.\nReferring to statistical significance does not necessarily mean that the overall result is significant in real world terms. For example, in a large study of a drug it may be shown that the drug has a statistically significant but very small beneficial effect, such that the drug is unlikely to help the patient noticeably.\nAlthough in principle the acceptable level of statistical significance may be subject to debate, the p-value is the smallest significance level that allows the test to reject the null hypothesis. This test is logically equivalent to saying that the p-value is the probability, assuming the null hypothesis is true, of observing a result at least as extreme as the test statistic. Therefore, the smaller the p-value, the lower the probability of committing type I error.\nSome problems are usually associated with this framework (See criticism of hypothesis testing):\n\nA difference that is highly statistically significant can still be of no practical significance, but it is possible to properly formulate tests to account for this. One response involves going beyond reporting only the significance level to include the p-value when reporting whether a hypothesis is rejected or accepted. The p-value, however, does not indicate the size or importance of the observed effect and can also seem to exaggerate the importance of minor differences in large studies. A better and increasingly common approach is to report confidence intervals. Although these are produced from the same calculations as those of hypothesis tests or p-values, they describe both the size of the effect and the uncertainty surrounding it.\nFallacy of the transposed conditional, aka prosecutor's fallacy: criticisms arise because the hypothesis testing approach forces one hypothesis (the null hypothesis) to be favored, since what is being evaluated is the probability of the observed result given the null hypothesis and not probability of the null hypothesis given the observed result. An alternative to this approach is offered by Bayesian inference, although it requires establishing a prior probability.\nRejecting the null hypothesis does not automatically prove the alternative hypothesis.\nAs everything in inferential statistics it relies on sample size, and therefore under fat tails p-values may be seriously mis-computed.\n\n\n===== Examples =====\nSome well-known statistical tests and procedures are:\n\n\n=== Exploratory data analysis ===\n\nExploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.\n\n\n== Misuse ==\n\nMisuse of statistics can produce subtle, but serious errors in description and interpretation\u2014subtle in the sense that even experienced professionals make such errors, and serious in the sense that they can lead to devastating decision errors. For instance, social policy, medical practice, and the reliability of structures like bridges all rely on the proper use of statistics.\nEven when statistical techniques are correctly applied, the results can be difficult to interpret for those lacking expertise. The statistical significance of a trend in the data\u2014which measures the extent to which a trend could be caused by random variation in the sample\u2014may or may not agree with an intuitive sense of its significance. The set of basic statistical skills (and skepticism) that people need to deal with information in their everyday lives properly is referred to as statistical literacy.\nThere is a general perception that statistical knowledge is all-too-frequently intentionally misused by finding ways to interpret only the data that are favorable to the presenter. A  mistrust and misunderstanding of statistics is associated with the quotation, \"There are three kinds of lies: lies, damned lies, and statistics\". Misuse of statistics can be both inadvertent and intentional, and the book How to Lie with Statistics outlines a range of considerations. In an attempt to shed light on the use and misuse of statistics, reviews of statistical techniques used in particular fields are conducted (e.g. Warne, Lazo, Ramos, and Ritter (2012)).Ways to avoid misuse of statistics include using proper diagrams and avoiding bias. Misuse can occur when conclusions are overgeneralized and claimed to be representative of more than they really are, often by either deliberately or unconsciously overlooking sampling bias. Bar graphs are arguably the easiest diagrams to use and understand, and they can be made either by hand or with simple computer programs. Unfortunately, most people do not look for bias or errors, so they are not noticed. Thus, people may often believe that something is true even if it is not well represented. To make data gathered from statistics believable and accurate, the sample taken must be representative of the whole.  According to Huff, \"The dependability of a sample can be destroyed by [bias]... allow yourself some degree of skepticism.\"To assist in the understanding of statistics Huff proposed a series of questions to be asked in each case:\nWho says so? (Does he/she have an axe to grind?)\nHow does he/she know? (Does he/she have the resources to know the facts?)\nWhat's missing? (Does he/she give us a complete picture?)\nDid someone change the subject? (Does he/she offer us the right answer to the wrong problem?)\nDoes it make sense? (Is his/her conclusion logical and consistent with what  we already know?)\n\n\n=== Misinterpretation: correlation ===\nThe concept of correlation is particularly noteworthy for the potential confusion it can cause. Statistical analysis of a data set often reveals that two variables (properties) of the population under consideration tend to vary together, as if they were connected. For example, a study of annual income that also looks at age of death might find that poor people tend to have shorter lives than affluent people. The two variables are said to be correlated; however, they may or may not be the cause of one another. The correlation phenomena could be caused by a third, previously unconsidered phenomenon, called a lurking variable or confounding variable. For this reason, there is no way to immediately infer the existence of a causal relationship between the two variables. (See Correlation does not imply causation.)\n\n\n== Applications ==\n\n\n=== Applied statistics, theoretical statistics and mathematical statistics ===\nApplied statistics comprises descriptive statistics and the application of inferential statistics. Theoretical statistics concerns the logical arguments underlying justification of approaches to statistical inference, as well as encompassing mathematical statistics. Mathematical statistics includes not only the manipulation of probability distributions necessary for deriving results related to methods of estimation and inference, but also various aspects of computational statistics and the design of experiments.\n\n\n=== Machine learning and data mining ===\nMachine learning models are statistical and probabilistic models that capture patterns in the data through use of computational algorithms.\n\n\n=== Statistics in society ===\nStatistics is applicable to a wide variety of academic disciplines, including natural and social sciences, government, and business. Statistical consultants can help organizations and companies that don't have in-house expertise relevant to their particular questions.\n\n\n=== Statistical computing ===\n\nThe rapid and sustained increases in computing power starting from the second half of the 20th century have had a substantial impact on the practice of statistical science. Early statistical models were almost always from the class of linear models, but powerful computers, coupled with suitable numerical algorithms, caused an increased interest in nonlinear models (such as neural networks) as well as the creation of new types, such as generalized linear models and multilevel models.\nIncreased computing power has also led to the growing popularity of computationally intensive methods based on resampling, such as permutation tests and the bootstrap, while techniques such as Gibbs sampling have made use of Bayesian models more feasible. The computer revolution has implications for the future of statistics with new emphasis on \"experimental\" and \"empirical\" statistics. A large number of both general and special purpose statistical software are now available. Examples of available software capable of complex statistical computation include programs such as Mathematica,  SAS, SPSS, and  R.\n\n\n=== Statistics applied to mathematics or the arts ===\nTraditionally, statistics was concerned with drawing inferences using a semi-standardized methodology that was \"required learning\" in most sciences. This tradition has changed with use of statistics in non-inferential contexts. What was once considered a dry subject, taken in many fields as a degree-requirement, is now viewed enthusiastically. Initially derided by some mathematical purists, it is now considered essential methodology in certain areas.\n\nIn number theory, scatter plots of data generated by a distribution function may be transformed with familiar tools used in statistics to reveal underlying patterns, which may then lead to hypotheses.\nMethods of statistics including predictive methods in forecasting are combined with chaos theory and fractal geometry to create video works that are considered to have great beauty.\nThe process art of Jackson Pollock relied on artistic experiments whereby underlying distributions in nature were artistically revealed. With the advent of computers, statistical methods were applied to formalize such distribution-driven natural processes to make and analyze moving video art.\nMethods of statistics may be used predicatively in performance art, as in a card trick based on a Markov process that only works some of the time, the occasion of which can be predicted using statistical methodology.\nStatistics can be used to predicatively create art, as in the statistical or stochastic music invented by Iannis Xenakis, where the music is performance-specific. Though this type of artistry does not always come out as expected, it does behave in ways that are predictable and tunable using statistics.\n\n\n== Specialized disciplines ==\n\nStatistical techniques are used in a wide range of types of scientific and social research, including: biostatistics,  computational biology,  computational sociology,  network biology,  social science,  sociology and  social research. Some fields of inquiry use applied statistics so extensively that they have specialized terminology. These disciplines include:\n\nIn addition, there are particular types of statistical analysis that have also developed their own specialised terminology and methodology:\n\nStatistics form a key basis tool in business and manufacturing as well. It is used to understand measurement systems variability, control processes (as in statistical process control or SPC), for summarizing data, and to make data-driven decisions. In these roles, it is a key tool, and perhaps the only reliable tool.\n\n\n== See also ==\n\nFoundations and major areas of statistics\n\n\n== References ==\n\n\n== Further reading ==\nLydia Denworth, \"A Significant Problem: Standard scientific methods are under fire. Will anything change?\", Scientific American, vol. 321, no. 4 (October 2019), pp. 62\u201367. \"The use of p values for nearly a century [since 1925] to determine statistical significance of experimental results has contributed to an illusion of certainty and [to] reproducibility crises in many scientific fields. There is growing determination to reform statistical analysis... Some [researchers] suggest changing statistical methods, whereas others would do away with a threshold for defining \"significant\" results.\" (p. 63.)\nBarbara Illowsky; Susan Dean (2014). Introductory Statistics. OpenStax CNX. ISBN 9781938168208.\nDavid W. Stockburger, Introductory Statistics: Concepts, Models, and Applications, 3rd Web Ed.  Missouri State University.\nOpenIntro Statistics, 3rd edition by Diez, Barr, and Cetinkaya-Rundel\nStephen Jones, 2010. Statistics in Psychology: Explanations without Equations. Palgrave Macmillan. ISBN 9781137282392.\nCohen, J. (1990). \"Things I have learned (so far)\". American Psychologist, 45, 1304\u20131312.\nGigerenzer, G. (2004). \"Mindless statistics\". Journal of Socio-Economics, 33, 587\u2013606. doi:10.1016/j.socec.2004.09.033\nIoannidis, J.P.A. (2005). \"Why most published research findings are false\". PLoS Medicine, 2, 696\u2013701. doi:10.1371/journal.pmed.0040168\n\n\n== External links ==\n\n(Electronic Version): StatSoft, Inc. (2013). Electronic Statistics Textbook. Tulsa, OK: StatSoft.\nOnline Statistics Education: An Interactive Multimedia Course of Study. Developed by Rice University (Lead Developer), University of Houston Clear Lake, Tufts University, and National Science Foundation.\nUCLA Statistical Computing Resources\nPhilosophy of Statistics from the Stanford Encyclopedia of Philosophy"}, {"job": "data scientist", "skill": "data visualization", "keywords": "data visualization", "description": "Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers of the images. This communication is achieved through the use of a systematic mapping between graphic marks and data values in the creation of the visualization. This mapping establishes how data values will be represented visually, determining how and to what extent a property of a graphic mark, such as size or color, will change to reflect change in the value of a datum.\nTo communicate information clearly and efficiently, data visualization uses statistical graphics, plots, information graphics and other tools. Numerical data may be encoded using dots, lines, or bars, to visually communicate a quantitative message. Effective visualization helps users analyze and reason about data and evidence. It makes complex data more accessible, understandable and usable. Users may have particular analytical tasks, such as making comparisons or understanding causality, and the design principle of the graphic (i.e., showing comparisons or showing causality) follows the task. Tables are generally used where users will look up a specific measurement, while charts of various types are used to show patterns or relationships in the data for one or more variables.\nData visualization is both an art and a science. It is viewed as a branch of descriptive statistics by some, but also as a grounded theory development tool by others. Increased amounts of data created by Internet activity and an expanding number of sensors in the environment are referred to as \"big data\" or Internet of things. Processing, analyzing and communicating this data present ethical and analytical challenges for data visualization. The field of data science and practitioners called data scientists help address this challenge.\n\n\n== Overview ==\n\nData visualization refers to the techniques used to communicate data or information by encoding it as visual objects (e.g., points, lines or bars) contained in graphics. The goal is to communicate information clearly and efficiently to users. It is one of the steps in data analysis or data science. According to Friedman (2008) the \"main goal of data visualization is to communicate information clearly and effectively through graphical means. It doesn't mean that data visualization needs to look boring to be functional or extremely sophisticated to look beautiful. To convey ideas effectively, both aesthetic form and functionality need to go hand in hand, providing insights into a rather sparse and complex data set by communicating its key-aspects in a more intuitive way. Yet designers often fail to achieve a balance between form and function, creating gorgeous data visualizations which fail to serve their main purpose \u2014 to communicate information\".Indeed, Fernanda Viegas and Martin M. Wattenberg suggested that an ideal visualization should not only communicate clearly, but stimulate viewer engagement and attention.Data visualization is closely related to information graphics, information visualization, scientific visualization, exploratory data analysis and statistical graphics. In the new millennium, data visualization has become an active area of research, teaching and development. According to Post et al. (2002), it has united scientific and information visualization.\n\n\n== Characteristics of effective graphical displays ==\n\nProfessor Edward Tufte explained that users of information displays are executing particular analytical tasks  such as making comparisons. The design principle of the information graphic should support the analytical task. As William Cleveland and Robert McGill show, different graphical elements accomplish this more or less effectively. For example, dot plots and bar charts outperform pie charts.In his 1983 book The Visual Display of Quantitative Information, Edward Tufte defines 'graphical displays' and principles for effective graphical display in the following passage:\n\"Excellence in statistical graphics consists of complex ideas communicated with clarity, precision and efficiency. Graphical displays should:\n\nshow the data\ninduce the viewer to think about the substance rather than about methodology, graphic design, the technology of graphic production or something else\navoid distorting what the data has to say\npresent many numbers in a small space\nmake large data sets coherent\nencourage the eye to compare different pieces of data\nreveal the data at several levels of detail, from a broad overview to the fine structure\nserve a reasonably clear purpose: description, exploration, tabulation or decoration\nbe closely integrated with the statistical and verbal descriptions of a data set.Graphics reveal data.  Indeed graphics can be more precise and revealing than conventional statistical computations.\"For example, the Minard diagram shows the losses suffered by Napoleon's army in the 1812\u20131813 period. Six variables are plotted: the size of the army, its location on a two-dimensional surface (x and y), time, direction of movement, and temperature. The line width illustrates a comparison (size of the army at points in time) while the temperature axis suggests a cause of the change in army size. This multivariate display on a two dimensional surface tells a story that can be grasped immediately while identifying the source data to build credibility.  Tufte wrote in 1983 that: \"It may well be the best statistical graphic ever drawn.\"Not applying these principles may result in misleading graphs, which distort the message or support an erroneous conclusion. According to Tufte, chartjunk refers to extraneous interior decoration of the graphic that does not enhance the message, or gratuitous three dimensional or perspective effects. Needlessly separating the explanatory key from the image itself, requiring the eye to travel back and forth from the image to the key, is a form of \"administrative debris.\" The ratio of \"data to ink\" should be maximized, erasing non-data ink where feasible.The Congressional Budget Office summarized several best practices for graphical displays in a June 2014 presentation.  These included: a) Knowing your audience; b) Designing graphics that can stand alone outside the context of the report; and c) Designing graphics that communicate the key messages in the report.\n\n\n== Quantitative messages ==\n\nAuthor Stephen Few described eight types of quantitative messages that users may attempt to understand or communicate from a set of data and the associated graphs used to help communicate the message:\n\nTime-series: A single variable is captured over a period of time, such as the unemployment rate over a 10-year period. A line chart may be used to demonstrate the trend.\nRanking: Categorical subdivisions are ranked in ascending or descending order, such as a ranking of sales performance (the measure) by sales persons (the category, with each sales person a categorical subdivision) during a single period.  A bar chart may be used to show the comparison across the sales persons.\nPart-to-whole: Categorical subdivisions are measured as a ratio to the whole (i.e., a percentage out of 100%).  A pie chart or bar chart can show the comparison of ratios, such as the market share represented by competitors in a market.\nDeviation: Categorical subdivisions are compared against a reference, such as a comparison of actual vs. budget expenses for several departments of a business for a given time period.  A bar chart can show comparison of the actual versus the reference amount.\nFrequency distribution: Shows the number of observations of a particular variable for given interval, such as the number of years in which the stock market return is between intervals such as 0-10%, 11-20%, etc. A histogram, a type of bar chart, may be used for this analysis. A boxplot helps visualize key statistics about the distribution, such as median, quartiles, outliers, etc.\nCorrelation: Comparison between observations represented by two variables (X,Y) to determine if they tend to move in the same or opposite directions. For example, plotting unemployment (X) and inflation (Y) for a sample of months. A scatter plot is typically used for this message.\nNominal comparison: Comparing categorical subdivisions in no particular order, such as the sales volume by product code. A bar chart may be used for this comparison.\nGeographic or geospatial: Comparison of a variable across a map or layout, such as the unemployment rate by state or the number of persons on the various floors of a building. A cartogram is a typical graphic used.Analysts reviewing a set of data may consider whether some or all of the messages and graphic types above are applicable to their task and audience. The process of trial and error to identify meaningful relationships and messages in the data is part of exploratory data analysis.\n\n\n== Visual perception and data visualization ==\nA human can distinguish differences in line length, shape, orientation, and color (hue) readily without significant processing effort; these are referred to as \"pre-attentive attributes\".  For example, it may require significant time and effort (\"attentive processing\") to identify the number of times the digit \"5\" appears in a series of numbers; but if that digit is different in size, orientation, or color, instances of the digit can be noted quickly through pre-attentive processing.Effective graphics take advantage of pre-attentive processing and attributes and the relative strength of these attributes. For example, since humans can more easily process differences in line length than surface area, it may be more effective to use a bar chart (which takes advantage of line length to show comparison) rather than pie charts (which use surface area to show comparison).\n\n\n=== Human perception/cognition and data visualization ===\nAlmost all data visualizations are created for human consumption. Knowledge of human perception and cognition is necessary when designing intuitive visualizations. Cognition refers to processes in human beings like perception, attention, learning, memory, thought, concept formation, reading, and problem solving. Human visual processing is efficient in detecting changes and making comparisons between quantities, sizes, shapes and variations in lightness. When properties of symbolic data are mapped to visual properties, humans can browse through large amounts of data efficiently. It is estimated that 2/3 of the brain's neurons can be involved in visual processing. Proper visualization provides a different approach to show potential connections, relationships, etc. which are not as obvious in non-visualized quantitative data. Visualization can become a means of data exploration.\n\n\n== History of data visualization ==\nThere is no comprehensive 'history' of data visualization. There are no accounts that span the entire development of visual thinking and the visual\nrepresentation of data, and which collate the contributions of disparate disciplines. Michael Friendly and Daniel J Denis of York University are engaged in a project that attempts to provide a comprehensive history of visualization. Contrary to general belief, data visualization is not a modern development. Stellar data, or information such as location of stars were visualized on the walls of caves (such as those found in Lascaux Cave in Southern France) since the Pleistocene era. Physical artefacts such as Mesopotamian clay tokens (5500 BC), Inca quipus (2600 BC) and Marshall Islands stick charts (n.d.) can also be considered as visualizing quantitative information.First documented data visualization can be tracked back to 1160 B.C. with Turin Papyrus Map which accurately illustrates the distribution of geological resources and provides information about quarrying of those resources. Such maps can be categorized as Thematic Cartography, which is a type of data visualization that presents and communicates specific data and information through a geographical illustration designed to show a particular theme connected with a specific geographic area. Earliest documented forms of data visualization were various thematic maps from different cultures and ideograms and hieroglyphs that provided and allowed interpretation of information illustrated. For example, Linear B tablets of Mycenae provided a visualization of information regarding Late Bronze Age era trades in the Mediterranean. The idea of coordinates was used by ancient Egyptian surveyors in laying out towns, earthly and heavenly positions were located by something akin to latitude and longitude at least by 200 BC, and the map projection of a spherical earth into latitude and longitude by Claudius Ptolemy [c.85\u2013c. 165] in Alexandria would serve as reference standards until the 14th century.\nInvention of paper and parchment allowed further development of visualizations throughout history. Figure shows a graph from the 10th, possibly 11th century that is intended to be an illustration of the planetary movement, used in an appendix of a textbook in monastery schools. The graph apparently was meant to represent a plot of the inclinations of the planetary orbits as a function of the time. For this purpose the zone of the zodiac was represented on a plane with a horizontal line divided into thirty parts as the time or longitudinal axis. The vertical axis designates the width of the zodiac. The horizontal scale appears to have been chosen for each planet individually for the periods cannot be reconciled. The accompanying text refers only to the amplitudes. The curves are apparently not related in time. \nBy the 16th century, techniques and instruments for precise observation and measurement of physical quantities, and geographic and celestial position were well-developed (for example, a \u201cwall quadrant\u201d constructed by Tycho Brahe [1546\u20131601], covering an entire wall in his observatory). Particularly important were the development of triangulation and other methods to determine mapping locations accurately.\nFrench philosopher and mathematician Ren\u00e9 Descartes and Pierre de Fermat developed analytic geometry and two-dimensional coordinate system which heavily influenced the practical methods of displaying and calculating values. Fermat and Blaise Pascal's work on statistics and probability theory laid the groundwork for what we now conceptualize as data. According to the Interaction Design Foundation, these developments allowed and helped William Playfair, who saw potential for graphical communication of quantitative data, to generate and develop graphical methods of statistics.  In the second half of the 20th century, Jacques Bertin used quantitative graphs to represent information \"intuitively, clearly, accurately, and efficiently\".John Tukey and Edward Tufte pushed the bounds of data visualization; Tukey with his new statistical approach of exploratory data analysis and Tufte with his book \"The Visual Display of Quantitative Information\" paved the way for refining data visualization techniques for more than statisticians. With the progression of technology came the progression of data visualization; starting with hand drawn visualizations and evolving into more technical applications \u2013 including interactive designs leading to software visualization.Programs like SAS, SOFA, R, Minitab, Cornerstone and more allow for data visualization in the field of statistics. Other data visualization applications, more focused and unique to individuals, programming languages such as D3, Python and JavaScript help to make the visualization of quantitative data a possibility.  Private schools have also developed programs to meet the demand for learning data visualization and associated programming libraries, including free programs like The Data Incubator or paid programs like General Assembly.\nBeginning with the Symposium \"Data to Discovery\"  in 2013, ArtCenter College of Design, Caltech and JPL in Pasadena have run an annual program on Interactive Data Visualization.  The program asks: How can interactive data visualization help scientists and engineers explore their data more effectively? How can computing, design, and design thinking help maximize research results? What methodologies are most effective for leveraging knowledge from these fields?  By encoding relational information with appropriate visual and interactive characteristics to help interrogate, and ultimately gain new insight into data, the program develops new interdisciplinary approaches to complex science problems, leveraging design thinking and the latest methods from computing, User-Centered Design, interaction design and 3D graphics. \n\n\n== Terminology ==\nData visualization involves specific terminology, some of which is derived from statistics. For example, author Stephen Few defines two types of data, which are used in combination to support a meaningful analysis or visualization:\n\nCategorical: Text labels describing the nature of the data, such as \"Name\" or \"Age\".  This term also covers qualitative (non-numerical) data.\nQuantitative: Numerical measures, such as \"25\" to represent the age in years.Two primary types of information displays are tables and graphs.\n\nA table contains quantitative data organized into rows and columns with categorical labels. It is primarily used to look up specific values. In the example above, the table might have categorical column labels representing the name (a qualitative variable)  and age (a quantitative variable), with each row of data representing one person (the sampled experimental unit or category subdivision).\nA graph is primarily used to show relationships among data and portrays values encoded as visual objects (e.g., lines, bars, or points). Numerical values are displayed within an area delineated by one or more axes.  These axes provide scales (quantitative and categorical) used to label and assign values to the visual objects. Many graphs are also referred to as charts.Eppler and Lengler have developed the \"Periodic Table of Visualization Methods,\" an interactive chart displaying various data visualization methods. It includes six types of data visualization methods: data, information, concept, strategy, metaphor and compound.\n\n\n== Examples of diagrams used for data visualization ==\n\n\n== Other perspectives ==\nThere are different approaches on the scope of data visualization. One common focus is on information presentation, such as Friedman (2008). Friendly (2008) presumes two main parts of data visualization: statistical graphics, and thematic cartography. In this line the \"Data Visualization: Modern Approaches\" (2007) article gives an overview of seven subjects of data visualization:\nArticles & resources\nDisplaying connections\nDisplaying data\nDisplaying news\nDisplaying websites\nMind maps\nTools and servicesAll these subjects are closely related to graphic design and information representation.\nOn the other hand, from a computer science perspective, Frits H. Post in 2002 categorized the field into sub-fields:\nInformation visualization\nInteraction techniques and architectures\nModelling techniques\nMultiresolution methods\nVisualization algorithms and techniques\nVolume visualization\n\n\n== Data presentation architecture ==\n\nData presentation architecture (DPA) is a skill-set that seeks to identify, locate, manipulate, format and present data in such a way as to optimally communicate meaning and proper knowledge.\nHistorically, the term data presentation architecture is attributed to Kelly Lautt: \"Data Presentation Architecture (DPA) is a rarely applied skill set critical for the success and value of Business Intelligence. Data presentation architecture weds the science of numbers, data and statistics in discovering valuable information from data and making it usable, relevant and actionable with the arts of data visualization, communications, organizational psychology and change management in order to provide business intelligence solutions with the data scope, delivery timing, format and visualizations that will most effectively support and drive operational, tactical and strategic behaviour toward understood business (or organizational) goals. DPA is neither an IT nor a business skill set but exists as a separate field of expertise. Often confused with data visualization, data presentation architecture is a much broader skill set that includes determining what data on what schedule and in what exact format is to be presented, not just the best way to present data that has already been chosen. Data visualization skills are one element of DPA.\"\n\n\n=== Objectives ===\nDPA has two main objectives:\n\nTo use data to provide knowledge in the most efficient manner possible (minimize noise, complexity, and unnecessary data or detail given each audience's needs and roles)\nTo use data to provide knowledge in the most effective manner possible  (provide relevant, timely and complete data to each audience member in a clear and understandable manner that conveys important meaning, is actionable and can affect understanding, behavior and decisions)\n\n\n=== Scope ===\nWith the above objectives in mind, the actual work of data presentation architecture consists of:\n\nCreating effective delivery mechanisms for each audience member depending on their role, tasks, locations and access to technology\nDefining important meaning (relevant knowledge) that is needed by each audience member in each context\nDetermining the required periodicity of data updates (the currency of the data)\nDetermining the right timing for data presentation (when and how often the user needs to see the data)\nFinding the right data (subject area, historical reach, breadth, level of detail, etc.)\nUtilizing appropriate analysis, grouping, visualization, and other presentation formats\n\n\n=== Related fields ===\nDPA work shares commonalities with several other fields, including:\n\nBusiness analysis in determining business goals, collecting requirements, mapping processes.\nBusiness process improvement in that its goal is to improve and streamline actions and decisions in furtherance of business goals\nData visualization in that it uses well-established theories of visualization to add or highlight meaning or importance in data presentation.\nInformation architecture, but information architecture's focus is on unstructured data and therefore excludes both analysis (in the statistical/data sense) and direct transformation of the actual content (data, for DPA) into new entities and combinations.\nHCI and interaction design, since the many of the principles in how to design interactive data visualisation have been developed cross-disciplinary with HCI.\nVisual journalism and data-driven journalism or data journalism: Visual journalism is concerned with all types of graphic facilitation of the telling of news stories, and data-driven and data journalism are not necessarily told with data visualisation. Nevertheless, the field of journalism are at the forefront in developing new data visualisations to communicate data.\nGraphic design, conveying information through styling, typography, position, and other aesthetic concerns.\n\n\n== See also ==\n\n\n== Notes ==\n\n\n== References ==\n\n\n== Further reading ==\nCleveland, William S. (1993). Visualizing Data. Hobart Press. ISBN 0-9634884-0-6.\nEvergreen, Stephanie (2016). Effective Data Visualization: The Right Chart for the Right Data. Sage. ISBN 978-1-5063-0305-5.\nHealy, Kieran (2019). Data Visualization: A Practical Introduction. Princeton: Princeton University Press. ISBN 978-0-691-18161-5.\nPost, Frits H.; Nielson, Gregory M.; Bonneau, Georges-Pierre (2003). Data Visualization: The State of the Art. New York: Springer. ISBN 978-1-4613-5430-7.\nWilke, Claus O. (2018). Fundamentals of Data Visualization. O'Reilly. ISBN 978-1-4920-3108-6.\nWilkinson, Leland (2012). Grammar of Graphics. New York: Springer. ISBN 978-1-4419-2033-1.\n\n\n== External links ==\nMilestones in the History of Thematic Cartography, Statistical Graphics, and Data Visualization, An illustrated chronology of innovations by Michael Friendly and Daniel J. Denis.\nDuke University-Christa Kelleher Presentation-Communicating through infographics-visualizing scientific & engineering information-March 6, 2015"}, {"job": "data scientist", "skill": "big data", "keywords": "big data", "description": "Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software. Data with many cases (rows) offer greater statistical power, while data with higher complexity (more attributes or columns) may lead to a higher false discovery rate. Big data challenges include capturing data, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy and data source. Big data was originally associated with three key concepts: volume, variety, and velocity. When we handle big data, we may not sample but simply observe and track what happens. Therefore, big data often includes data with sizes that exceed the capacity of traditional software to process within an acceptable time and value.\nCurrent usage of the term big data tends to refer to the use of predictive analytics, user behavior analytics, or certain other advanced data analytics methods that extract value from data, and seldom to a particular size of data set. \"There is little doubt that the quantities of data now available are indeed large, but that's not the most relevant characteristic of this new data ecosystem.\"\nAnalysis of data sets can find new correlations to \"spot business trends, prevent diseases, combat crime and so on.\" Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas including Internet searches, fintech, urban informatics, and business informatics.  Scientists encounter limitations in e-Science work, including meteorology, genomics, connectomics, complex physics simulations, biology and environmental research.Data sets grow rapidly, to a certain extent because they are increasingly gathered by cheap and numerous information-sensing Internet of things devices such as mobile devices, aerial (remote sensing), software logs, cameras, microphones, radio-frequency identification (RFID) readers and wireless sensor networks. The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes (2.5\u00d71018) of data are generated. Based on an IDC report prediction, the global data volume will grow exponentially from 4.4 zettabytes to 44 zettabytes between 2013 and 2020. By 2025, IDC predicts there will be 163 zettabytes of data. One question for large enterprises is determining who should own big-data initiatives that affect the entire organization.Relational database management systems, desktop statistics and software packages used to visualize data often have difficulty handling big data. The work may require \"massively parallel software running on tens, hundreds, or even thousands of servers\". What qualifies as being \"big data\" varies depending on the capabilities of the users and their tools, and expanding capabilities make big data a moving target. \"For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.\"\n\n\n== Definition ==\nThe term has been in use since the 1990s, with some giving credit to John Mashey for popularizing the term.\nBig data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time. Big data philosophy encompasses unstructured, semi-structured and structured data, however the main focus is on unstructured data. Big data \"size\" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many zettabytes of data.\nBig data requires a set of techniques and technologies with new forms of integration to reveal insights from data-sets that are diverse, complex, and of a massive scale.\"Variety\", \"veracity\" and various other \"Vs\" are added by some organizations to describe it, a revision challenged by some industry authorities.A 2018 definition states \"Big data is where parallel computing tools are needed to handle data\", and notes, \"This represents a distinct and clearly defined change in the computer science used, via parallel programming theories, and losses of\nsome of the guarantees and capabilities made by Codd's relational model.\" The growing maturity of the concept more starkly delineates the difference between \"big data\" and \"Business Intelligence\":\nBusiness Intelligence uses applied mathematics tools and descriptive statistics with data with high information density to measure things, detect trends, etc.\nBig data uses mathematical analysis, optimization,  inductive statistics and concepts from nonlinear system identification  to infer laws (regressions, nonlinear relationships, and causal effects) from large sets of data with low information density to reveal relationships and dependencies, or to perform predictions of outcomes and behaviors.\n\n\n== Characteristics ==\n\nBig data can be described by the following characteristics:\n\nVolume\nThe quantity of generated and stored data. The size of the data determines the value and potential insight, and whether it can be considered big data or not.Variety\nThe type and nature of the data. This helps people who analyze it to effectively use the resulting insight. Big data draws from text, images, audio, video; plus it completes missing pieces through data fusion.Velocity\nThe speed at which the data is generated and processed to meet the demands and challenges that lie in the path of growth and development. Big data is often available in real-time. Compared to small data, big data are produced more continually. Two kinds of velocity related to big data are the frequency of generation and the frequency of handling, recording, and publishing.Veracity\nIt is the extended definition for big data, which refers to the data quality and the data value. The data quality of captured data can vary greatly, affecting the accurate analysis.Data must be processed with advanced tools (analytics and algorithms) to reveal meaningful information. For example, to manage a factory, one must consider both visible and invisible issues with various components. Information generation algorithms must detect and address invisible issues such as machine degradation, component wear, etc. on the factory floor.Other important characteristics of Big Data are:\nExhaustive\nWhether the entire system (i.e.,  \n  \n    \n      \n        n\n      \n    \n    {\\textstyle n}\n  =all) is captured or recorded or not.Fine-grained and uniquely lexical\nRespectively, the proportion of specific data of each element per element collected and if the element and its characteristics are properly indexed or identified.Relational\nIf the data collected contains commons fields that would enable a conjoining, or meta-analysis, of different data sets.Extensional\nIf new fields in each element of the data collected can be added or changed easily.Scalability\nIf the size of the data can expand rapidly.Value\nThe utility that can be extracted from the data.Variability\nIt refers to data whose value or other characteristics are shifting in relation to the context they are being generated.\n\n\n== Architecture ==\nBig data repositories have existed in many forms, often built by corporations with a special need.  Commercial vendors historically offered parallel database management systems for big data beginning in the 1990s.  For many years, WinterCorp published the largest database report.Teradata Corporation in 1984 marketed the parallel processing DBC 1012 system. Teradata systems were the first to store and analyze 1 terabyte of data in 1992. Hard disk drives were 2.5 GB in 1991 so the definition of big data continuously evolves according to Kryder's Law. Teradata installed the first petabyte class RDBMS based system in 2007. As of 2017, there are a few dozen petabyte class Teradata relational databases installed, the largest of which exceeds 50 PB. Systems up until 2008 were 100% structured relational data.  Since then, Teradata has added unstructured data types including XML, JSON, and Avro.\nIn 2000, Seisint Inc. (now LexisNexis Risk Solutions) developed a C++-based distributed platform for data processing and querying known as the HPCC Systems platform. This system automatically partitions, distributes, stores and delivers structured, semi-structured, and unstructured data across multiple commodity servers.  Users can write data processing pipelines and queries in a declarative dataflow programming language called ECL. Data analysts working in ECL are not required to define data schemas upfront and can rather focus on the particular problem at hand, reshaping data in the best possible manner as they develop the solution. In 2004, LexisNexis acquired Seisint Inc. and their high-speed parallel processing platform and successfully utilized this platform to integrate the data systems of Choicepoint Inc. when they acquired that company in 2008. In 2011, the HPCC systems platform was open-sourced under the Apache v2.0 License.\nCERN and other physics experiments have collected big data sets for many decades, usually analyzed via high-performance computing (supercomputers) rather than the commodity map-reduce architectures usually meant by the current \"big data\" movement.\nIn 2004, Google published a paper on a process called MapReduce that uses a similar architecture. The MapReduce concept provides a parallel processing model, and an associated implementation was released to process huge amounts of data.  With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the Map step). The results are then gathered and delivered (the Reduce step). The framework was very successful, so others wanted to replicate the algorithm. Therefore, an implementation of the MapReduce framework was adopted by an Apache open-source project named Hadoop. Apache Spark was developed in 2012 in response to limitations in the MapReduce paradigm, as it adds the ability to set up many operations (not just map followed by reducing).\nMIKE2.0 is an open approach to information management that acknowledges the need for revisions due to big data implications identified in an article titled \"Big Data Solution Offering\". The methodology addresses handling big data in terms of useful permutations of data sources, complexity in interrelationships, and difficulty in deleting (or modifying) individual records.2012 studies showed that a multiple-layer architecture is one option to address the issues that big data presents. A distributed parallel architecture distributes data across multiple servers; these parallel execution environments can dramatically improve data processing speeds. This type of architecture inserts data into a parallel DBMS, which implements the use of MapReduce and Hadoop frameworks. This type of framework looks to make the processing power transparent to the end-user by using a front-end application server.The data lake allows an organization to shift its focus from centralized control to a shared model to respond to the changing dynamics of information management. This enables quick segregation of data into the data lake, thereby reducing the overhead time.Big data analytics for manufacturing applications is marketed as a \"5C architecture\" (connection, conversion, cyber, cognition, and configuration).\nFactory work and Cyber-physical systems may have an extended \"6C system\":\n\nConnection (sensor and networks)\nCloud (computing and data on demand)\nCyber (model and memory)\nContent/context (meaning and correlation)\nCommunity (sharing and collaboration)\nCustomization (personalization and value)\n\n\n== Technologies ==\nA 2011 McKinsey Global Institute report characterizes the main components and ecosystem of big data as follows:\nTechniques for analyzing data, such as A/B testing, machine learning and natural language processing\nBig data technologies, like business intelligence, cloud computing and databases\nVisualization, such as charts, graphs and other displays of the dataMultidimensional big data can also be represented as OLAP data cubes or, mathematically, tensors. Array Database Systems have set out to provide storage and high-level query support on this data type.\nAdditional technologies being applied to big data include efficient tensor-based computation, such as multilinear subspace learning., massively parallel-processing (MPP) databases, search-based applications, data mining, distributed file systems, distributed cache (e.g., burst buffer and Memcached), distributed databases, cloud and  HPC-based infrastructure (applications, storage and computing resources) and the Internet. Although, many approaches and technologies have been developed, it still remains difficult to carry out machine learning with big data.Some MPP relational databases have the ability to store and manage petabytes of data. Implicit is the ability to load, monitor, back up, and optimize the use of the large data tables in the RDBMS.DARPA's Topological Data Analysis program seeks the fundamental structure of massive data sets and in 2008 the technology went public with the launch of a company called Ayasdi.The practitioners of big data analytics processes are generally hostile to slower shared storage, preferring direct-attached storage (DAS) in its various forms from solid state drive (SSD) to high capacity SATA disk buried inside parallel processing nodes. The perception of shared storage architectures\u2014Storage area network (SAN) and Network-attached storage (NAS) \u2014is that they are relatively slow, complex, and expensive. These qualities are not consistent with big data analytics systems that thrive on system performance, commodity infrastructure, and low cost.\nReal or near-real time information delivery is one of the defining characteristics of big data analytics. Latency is therefore avoided whenever and wherever possible. Data in direct-attached memory or disk is good\u2014data on memory or disk at the other end of a FC SAN connection is not. The cost of a SAN at the scale needed for analytics applications is very much higher than other storage techniques.\nThere are advantages as well as disadvantages to shared storage in big data analytics, but big data analytics practitioners as of 2011 did not favour it.\n\n\n== Applications ==\n\nBig data has increased the demand of information management specialists so much so that Software AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, HP and Dell have spent more than $15 billion on software firms specializing in data management and analytics. In 2010, this industry was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole.Developed economies increasingly use data-intensive technologies. There are 4.6 billion mobile-phone subscriptions worldwide, and between 1 billion and 2 billion people accessing the internet. Between 1990 and 2005, more than 1 billion people worldwide entered the middle class, which means more people became more literate, which in turn led to information growth. The world's effective capacity to exchange information through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007 and predictions put the amount of internet traffic at 667 exabytes annually by 2014. According to one estimate, one-third of the globally stored information is in the form of alphanumeric text and still image data, which is the format most useful for most big data applications. This also shows the potential of yet unused data (i.e. in the form of video and audio content).\nWhile many vendors offer off-the-shelf solutions for big data, experts recommend the development of in-house solutions custom-tailored to solve the company's problem at hand if the company has sufficient technical capabilities.\n\n\n=== Government ===\nThe use and adoption of big data within governmental processes allows efficiencies in terms of cost, productivity, and innovation, but does not come without its flaws. Data analysis often requires multiple parts of government (central and local) to work in collaboration and create new and innovative processes to deliver the desired outcome.\nCRVS (Civil Registration and Vital Statistics) collects all certificates status from birth to death. CRVS is a source of big data for governments.\n\n\n=== International development ===\nResearch on the effective usage of information and communication technologies for development (also known as ICT4D) suggests that big data technology can make important contributions but also present unique challenges to International development. Advancements in big data analysis offer cost-effective opportunities to improve decision-making in critical development areas such as health care, employment, economic productivity, crime, security, and natural disaster and resource management. Additionally, user-generated data offers new opportunities to give the unheard a voice. However, longstanding challenges for developing regions such as inadequate technological infrastructure and economic and human resource scarcity exacerbate existing concerns with big data such as privacy, imperfect methodology, and interoperability issues.\n\n\n=== Manufacturing ===\nBased on TCS 2013 Global Trend Study, improvements in supply planning and product quality provide the greatest benefit of big data for manufacturing. Big data provides an infrastructure for transparency in the manufacturing industry, which is the ability to unravel uncertainties such as inconsistent component performance and availability. Predictive manufacturing as an applicable approach toward near-zero downtime and transparency requires a vast amount of data and advanced prediction tools for a systematic process of data into useful information. A conceptual framework of predictive manufacturing begins with data acquisition where different type of sensory data is available to acquire such as acoustics, vibration, pressure, current, voltage and controller data. A vast amount of sensory data in addition to historical data construct the big data in manufacturing. The generated big data acts as the input into predictive tools and preventive strategies such as Prognostics and Health Management (PHM).\n\n\n=== Healthcare ===\nBig data analytics has helped healthcare improve by providing personalized medicine and prescriptive analytics, clinical risk intervention and predictive analytics, waste and care variability reduction, automated external and internal reporting of patient data, standardized medical terms and patient registries and fragmented point solutions. Some areas of improvement are more aspirational than actually implemented. The level of data generated within healthcare systems is not trivial. With the added adoption of mHealth, eHealth and wearable technologies the volume of data will continue to increase. This includes electronic health record data, imaging data, patient generated data, sensor data, and other forms of difficult to process data. There is now an even greater need for such environments to pay greater attention to data and information quality. \"Big data very often means 'dirty data' and the fraction of data inaccuracies increases with data volume growth.\" Human inspection at the big data scale is impossible and there is a desperate need in health service for intelligent tools for accuracy and believability control and handling of information missed. While extensive information in healthcare is now electronic, it fits under the big data umbrella as most is unstructured and difficult to use. The use of big data in healthcare has raised significant ethical challenges ranging from risks for individual rights, privacy and autonomy, to transparency and trust.Big data in health research is particularly promising in terms of exploratory biomedical research, as data-driven analysis can move forward more quickly than hypothesis-driven research. Then, trends seen in data analysis can be tested in traditional, hypothesis-driven followup biological research and eventually clinical research.\nA related application sub-area, that heavily relies on big data, within the healthcare field is that of computer-aided diagnosis in medicine.\n  One only needs to recall that, for instance, for epilepsy monitoring it is customary to create 5 to 10 GB of data daily. \n Similarly, a single uncompressed image of breast tomosynthesis averages 450 MB of data. \n\nThese are just few of the many examples where computer-aided diagnosis uses big data.  For this reason, big data has been recognized as one of the seven key challenges that computer-aided diagnosis systems need to overcome in order to reach the next level of performance. \n\n\n=== Education ===\nA McKinsey Global Institute study found a shortage of 1.5 million highly trained data professionals and managers and a number of universities including University of Tennessee and UC Berkeley, have created masters programs to meet this demand.  Private boot camps have also developed programs to meet that demand, including free programs like The Data Incubator or paid programs like General Assembly. In the specific field of marketing, one of the problems stressed by Wedel and Kannan  is that marketing has several sub domains (e.g., advertising, promotions,\nproduct development, branding) that all use different types of data. Because one-size-fits-all analytical solutions are not desirable, business schools should prepare marketing managers to have wide knowledge on all the different techniques used in these sub domains to get a big picture and work effectively with analysts.\n\n\n=== Media ===\nTo understand how the media utilizes big data, it is first necessary to provide some context into the mechanism used for media process. It has been suggested by Nick Couldry and Joseph Turow that practitioners in Media and Advertising approach big data as many actionable points of information about millions of individuals. The industry appears to be moving away from the traditional approach of using specific media environments such as newspapers, magazines, or television shows and instead taps into consumers with technologies that reach targeted people at optimal times in optimal locations.  The ultimate aim is to serve or convey, a message or content that is (statistically speaking) in line with the consumer's mindset. For example, publishing environments are increasingly tailoring messages (advertisements) and content (articles) to appeal to consumers that have been exclusively gleaned through various data-mining activities.\nTargeting of consumers (for advertising by marketers) \nData capture\nData journalism: publishers and journalists use big data tools to provide unique and innovative insights and infographics.Channel 4, the British public-service television broadcaster, is a leader in the field of big data and data analysis.\n\n\n=== Insurance ===\nHealth insurance providers are collecting data on social \"determinants of health\" such as food and TV consumption, marital status, clothing size and purchasing habits, from which they make predictions on health costs, in order to spot health issues in their clients. It is controversial whether these predictions are currently being used for pricing.\n\n\n=== Internet of Things (IoT) ===\n\nBig data and the IoT work in conjunction. Data extracted from IoT devices provides a mapping of device inter-connectivity. Such mappings have been used by the media industry, companies and governments to more accurately target their audience and increase media efficiency. IoT is also increasingly adopted as a means of gathering sensory data, and this sensory data has been used in medical, manufacturing  and transportation  contexts.\nKevin Ashton, digital innovation expert who is credited with coining the term, defines the Internet of Things in this quote: \u201cIf we had computers that knew everything there was to know about things\u2014using data they gathered without any help from us\u2014we would be able to track and count everything, and greatly reduce waste, loss, and cost. We would know when things needed replacing, repairing or recalling, and whether they were fresh or past their best.\u201d\n\n\n=== Information technology ===\nEspecially since 2015, big data has come to prominence within business operations as a tool to help employees work more efficiently and streamline the collection and distribution of information technology (IT). The use of big data to resolve IT and data collection issues within an enterprise is called IT operations analytics (ITOA). By applying big data principles into the concepts of machine intelligence and deep computing, IT departments can predict potential issues and move to provide solutions before the problems even happen. In this time, ITOA businesses were also beginning to play a major role in systems management by offering platforms that brought individual data silos together and generated insights from the whole of the system rather than from isolated pockets of data.\n\n\n== Case studies ==\n\n\n=== Government ===\n\n\n==== China ====\nThe Integrated Joint Operations Platform (IJOP, \u4e00\u4f53\u5316\u8054\u5408\u4f5c\u6218\u5e73\u53f0) is used by the government to monitor the population, particularly Uyghurs. Biometrics, including DNA samples, are gathered through a program of free physicals.\nBy 2020, China plans to give all its citizens a personal \"Social Credit\" score based on how they behave. The Social Credit System, now being piloted in a number of Chinese cities, is considered a form of mass surveillance which uses big data analysis technology.\n\n\n==== India ====\nBig data analysis was tried out for the BJP to win the Indian General Election 2014.\nThe Indian government utilizes numerous techniques to ascertain how the Indian electorate is responding to government action, as well as ideas for policy augmentation.\n\n\n==== Israel ====\nA big data application was designed by Agro Web Lab to aid irrigation regulation.\nPersonalized diabetic treatments can be created through GlucoMe's big data solution.\n\n\n==== United Kingdom ====\nExamples of uses of big data in public services:\n\nData on prescription drugs: by connecting origin, location and the time of each prescription, a research unit was able to exemplify the considerable delay between the release of any given drug, and a UK-wide adaptation of the National Institute for Health and Care Excellence guidelines. This suggests that new or most up-to-date drugs take some time to filter through to the general patient.\nJoining up data: a local authority blended data about services, such as road gritting rotas, with services for people at risk, such as 'meals on wheels'. The connection of data allowed the local authority to avoid any weather-related delay.\n\n\n==== United States of America ====\nIn 2012, the Obama administration announced the Big Data Research and Development Initiative, to explore how big data could be used to address important problems faced by the government. The initiative is composed of 84 different big data programs spread across six departments.\nBig data analysis played a large role in Barack Obama's successful 2012 re-election campaign.\nThe United States Federal Government owns five of the ten most powerful supercomputers in the world.\nThe Utah Data Center has been constructed by the United States National Security Agency. When finished, the facility will be able to handle a large amount of information collected by the NSA over the Internet. The exact amount of storage space is unknown, but more recent sources claim it will be on the order of a few exabytes. This has posed security concerns regarding the anonymity of the data collected.\n\n\n=== Retail ===\nWalmart handles more than 1 million customer transactions every hour, which are imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data\u2014the equivalent of 167 times the information contained in all the books in the US Library of Congress.\nWindermere Real Estate uses location information from nearly 100 million drivers to help new home buyers determine their typical drive times to and from work throughout various times of the day.\nFICO Card Detection System protects accounts worldwide.\n\n\n=== Science ===\nThe Large Hadron Collider experiments represent about 150 million sensors delivering data 40 million times per second. There are nearly 600 million collisions per second. After filtering and refraining from recording more than 99.99995% of these streams, there are 1,000 collisions of interest per second.As a result, only working with less than 0.001% of the sensor stream data, the data flow from all four LHC experiments represents 25 petabytes annual rate before replication (as of 2012). This becomes nearly 200 petabytes after replication.\nIf all sensor data were recorded in LHC, the data flow would be extremely hard to work with. The data flow would exceed 150 million petabytes annual rate, or nearly 500 exabytes per day, before replication. To put the number in perspective, this is equivalent to 500 quintillion (5\u00d71020) bytes per day, almost 200 times more than all the other sources combined in the world.\nThe Square Kilometre Array is a radio telescope built of thousands of antennas. It is expected to be operational by 2024. Collectively, these antennas are expected to gather 14 exabytes and store one petabyte per day. It is considered one of the most ambitious scientific projects ever undertaken.\nWhen the Sloan Digital Sky Survey (SDSS) began to collect astronomical data in 2000, it amassed more in its first few weeks than all data collected in the history of astronomy previously. Continuing at a rate of about 200 GB per night, SDSS has amassed more than 140 terabytes of information. When the Large Synoptic Survey Telescope, successor to SDSS, comes online in 2020, its designers expect it to acquire that amount of data every five days.\nDecoding the human genome originally took 10 years to process; now it can be achieved in less than a day. The DNA sequencers have divided the sequencing cost by 10,000 in the last ten years, which is 100 times cheaper than the reduction in cost predicted by Moore's Law.\nThe NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster.\nGoogle's DNAStack compiles and organizes DNA samples of genetic data from around the world to identify diseases and other medical defects. These fast and exact calculations eliminate any 'friction points,' or human errors that could be made by one of the numerous science and biology experts working with the DNA. DNAStack, a part of Google Genomics, allows scientists to use the vast sample of resources from Google's search server to scale social experiments that would usually take years, instantly.\n23andme's DNA database contains genetic information of over 1,000,000 people worldwide. The company explores selling the \"anonymous aggregated genetic data\" to other researchers and pharmaceutical companies for research purposes if patients give their consent. Ahmad Hariri, professor of psychology and neuroscience at Duke University who has been using 23andMe in his research since 2009 states that the most important aspect of the company's new service is that it makes genetic research accessible and relatively cheap for scientists. A study that identified 15 genome sites linked to depression in 23andMe's database lead to a surge in demands to access the repository with 23andMe fielding nearly 20 requests to access the depression data in the two weeks after publication of the paper.\nComputational Fluid Dynamics (CFD) and hydrodynamic turbulence research generate massive data sets. The Johns Hopkins Turbulence Databases (JHTDB) contains over 350 terabytes of spatiotemporal fields from Direct Numerical simulations of various turbulent flows. Such data have been difficult to share using traditional methods such as downloading flat simulation output files. The data within JHTDB can be accessed using \"virtual sensors\" with various access modes ranging from direct web-browser queries, access through Matlab, Python, Fortran and C programs executing on clients' platforms, to cut out services to download raw data.  The data have been used in over  150 scientific publications.\n\n\n=== Sports ===\nBig data can be used to improve training and understanding competitors, using sport sensors. It is also possible to predict winners in a match using big data analytics.\nFuture performance of players could be predicted as well. Thus, players' value and salary is determined by data collected throughout the season.In Formula One races, race cars with hundreds of sensors generate terabytes of data. These sensors collect data points from tire pressure to fuel burn efficiency.\nBased on the data, engineers and data analysts decide whether adjustments should be made in order to win a race. Besides, using big data, race teams try to predict the time they will finish the race beforehand, based on simulations using data collected over the season.\n\n\n=== Technology ===\neBay.com uses two data warehouses at 7.5 petabytes and 40PB as well as a 40PB Hadoop cluster for search, consumer recommendations, and merchandising.\nAmazon.com handles millions of back-end operations every day, as well as queries from more than half a million third-party sellers. The core technology that keeps Amazon running is Linux-based and as of 2005 they had the world's three largest Linux databases, with capacities of 7.8 TB, 18.5 TB, and 24.7 TB.\nFacebook handles 50 billion photos from its user base. As of June 2017, Facebook reached 2 billion monthly active users.\nGoogle was handling roughly 100 billion searches per month as of August 2012.\n\n\n== Research activities ==\nEncrypted search and cluster formation in big data was demonstrated in March 2014 at the American Society of Engineering Education. Gautam Siwach engaged at Tackling the challenges of Big Data by MIT Computer Science and Artificial Intelligence Laboratory and Dr. Amir Esmailpour at UNH Research Group investigated the key features of big data as the formation of clusters and their interconnections. They focused on the security of big data and the orientation of the term towards the presence of different types of data in an encrypted form at cloud interface by providing the raw definitions and real-time examples within the technology. Moreover, they proposed an approach for identifying the encoding technique to advance towards an expedited search over encrypted text leading to the security enhancements in big data.In March 2012, The White House announced a national \"Big Data Initiative\" that consisted of six Federal departments and agencies committing more than $200 million to big data research projects.The initiative included a National Science Foundation \"Expeditions in Computing\" grant of $10 million over 5 years to the AMPLab at the University of California, Berkeley. The AMPLab also received funds from DARPA, and over a dozen industrial sponsors and uses big data to attack a wide range of problems from predicting traffic congestion to fighting cancer.The White House Big Data Initiative also included a commitment by the  Department of Energy to provide $25 million in funding over 5 years to establish the scalable Data Management, Analysis and Visualization (SDAV) Institute, led by the Energy Department's Lawrence Berkeley National Laboratory. The SDAV Institute aims to bring together the expertise of six national laboratories and seven universities to develop new tools to help scientists manage and visualize data on the Department's supercomputers.\nThe U.S. state of Massachusetts announced the Massachusetts Big Data Initiative in May 2012, which provides funding from the state government and private companies to a variety of research institutions.  The Massachusetts Institute of Technology hosts the Intel Science and Technology Center for Big Data in the MIT Computer Science and Artificial Intelligence Laboratory, combining government, corporate, and institutional funding and research efforts.The European Commission is funding the 2-year-long Big Data Public Private Forum through their Seventh Framework Program to engage companies, academics and other stakeholders in discussing big data issues. The project aims to define a strategy in terms of research and innovation to guide supporting actions from the European Commission in the successful implementation of the big data economy. Outcomes of this project will be used as input for Horizon 2020, their next framework program.The British government announced in March 2014 the founding of the Alan Turing Institute, named after the computer pioneer and code-breaker, which will focus on new ways to collect and analyze large data sets.At the University of Waterloo Stratford Campus Canadian Open Data Experience (CODE) Inspiration Day, participants demonstrated how using data visualization can increase the understanding and appeal of big data sets and communicate their story to the world.The National Science Foundation has granted the industry-university cooperative research center for Intelligent Maintenance Systems (IMS) to focus on developing advanced predictive tools and techniques to be applicable in a big data environment. In May 2013, the IMS Center held an industry advisory board meeting focusing on big data where presenters from various industrial companies discussed their concerns, issues and future goals in the big data environment.\nComputational social sciences \u2013 Anyone can use Application Programming Interfaces (APIs) provided by big data holders, such as Google and Twitter, to do research in the social and behavioral sciences. Often these APIs are provided for free. Tobias Preis et al. used Google Trends data to demonstrate that Internet users from countries with a higher per capita gross domestic product (GDP) are more likely to search for information about the future than information about the past. The findings suggest there may be a link between online behaviour and real-world economic indicators. The authors of the study examined Google queries logs made by ratio of the volume of searches for the coming year ('2011') to the volume of searches for the previous year ('2009'), which they call the 'future orientation index'. They compared the future orientation index to the per capita GDP of each country, and found a strong tendency for countries where Google users inquire more about the future to have a higher GDP. The results hint that there may potentially be a relationship between the economic success of a country and the information-seeking behavior of its citizens captured in big data.\nTobias Preis and his colleagues Helen Susannah Moat and H. Eugene Stanley introduced a method to identify online precursors for stock market moves, using trading strategies based on search volume data provided by Google Trends. Their analysis of Google search volume for 98 terms of varying financial relevance, published in Scientific Reports, suggests that increases in search volume for financially relevant search terms tend to precede large losses in financial markets.Big data sets come with algorithmic challenges that previously did not exist. Hence, there is a need to fundamentally change the processing ways.The Workshops on Algorithms for Modern Massive Data Sets (MMDS) bring together computer scientists, statisticians, mathematicians, and data analysis practitioners to discuss algorithmic challenges of big data.  Regarding big data, one needs to keep in mind that such concepts of magnitude are relative.  As it is stated \"If the past is of any guidance, then today\u2019s big data most likely will not be considered as such in the near future.\" \n\n\n=== Sampling big data ===\nAn important research question that can be asked about big data sets is whether you need to look at the full data to draw certain conclusions about the properties of the data or is a sample good enough. The name big data itself contains a term related to size and this is an important characteristic of big data. But Sampling (statistics) enables the selection of right data points from within the larger data set to estimate the characteristics of the whole population. For example, there are about 600 million tweets produced every day. Is it necessary to look at all of them to determine the topics that are discussed during the day? Is it necessary to look at all the tweets to determine the sentiment on each of the topics? In manufacturing different types of sensory data such as acoustics, vibration, pressure, current, voltage and controller data are available at short time intervals. To predict downtime it may not be necessary to look at all the data but a sample may be sufficient.  Big Data can be broken down by various data point categories such as demographic, psychographic, behavioral, and transactional data.  With large sets of data points, marketers are able to create and utilize more customized segments of consumers for more strategic targeting.\nThere has been some work done in Sampling algorithms for big data. A theoretical formulation for sampling Twitter data has been developed.\n\n\n== Critique ==\nCritiques of the big data paradigm come in two flavors, those that question the implications of the approach itself, and those that question the way it is currently done. One approach to this criticism is the field of critical data studies.\n\n\n=== Critiques of the big data paradigm ===\n\"A crucial problem is that we do not know much about the underlying empirical micro-processes that lead to the emergence of the[se] typical network characteristics of Big Data\". In their critique, Snijders, Matzat, and Reips point out that often very strong assumptions are made about mathematical properties that may not at all reflect what is really going on at the level of micro-processes. Mark Graham has leveled broad critiques at Chris Anderson's assertion that big data will spell the end of theory: focusing in particular on the notion that big data must always be contextualized in their social, economic, and political contexts. Even as companies invest eight- and nine-figure sums to derive insight from information streaming in from suppliers and customers, less than 40% of employees have sufficiently mature processes and skills to do so. To overcome this insight deficit, big data, no matter how comprehensive or well analyzed, must be complemented by \"big judgment,\" according to an article in the Harvard Business Review.Much in the same line, it has been pointed out that the decisions based on the analysis of big data are inevitably \"informed by the world as it was in the past, or, at best, as it currently is\".  Fed by a large number of data on past experiences, algorithms can predict future development if the future is similar to the past. If the system's dynamics of the future change (if it is not a stationary process), the past can say little about the future. In order to make predictions in changing environments, it would be necessary to have a thorough understanding of the systems dynamic, which requires theory.  As a response to this critique Alemany Oliver and Vayre suggest to use \"abductive reasoning as a first step in the research process in order to bring context to consumers' digital traces and make new theories emerge\".\nAdditionally, it has been suggested to combine big data approaches with computer simulations, such as agent-based models and complex systems. Agent-based models are increasingly getting better in predicting the outcome of social complexities of even unknown future scenarios through computer simulations that are based on a collection of mutually interdependent algorithms. Finally, the use of multivariate methods that probe for the latent structure of the data, such as factor analysis and cluster analysis, have proven useful as analytic approaches that go well beyond the bi-variate approaches (cross-tabs) typically employed with smaller data sets.\nIn health and biology, conventional scientific approaches are based on experimentation. For these approaches, the limiting factor is the relevant data that can confirm or refute the initial hypothesis.\nA new postulate is accepted now in biosciences: the information provided by the data in huge volumes (omics) without prior hypothesis is complementary and sometimes necessary to conventional approaches based on experimentation. In the massive approaches it is the formulation of a relevant hypothesis to explain the data that is the limiting factor. The search logic is reversed and the limits of induction (\"Glory of Science and Philosophy scandal\", C. D. Broad, 1926) are to be considered.Privacy advocates are concerned about the threat to privacy represented by increasing storage and integration of personally identifiable information; expert panels have released various policy recommendations to conform practice to expectations of privacy. The misuse of Big Data in several cases by media, companies and even the government has allowed for abolition of trust in almost every fundamental institution holding up society.Nayef Al-Rodhan argues that a new kind of social contract will be needed to protect individual liberties in a context of Big Data and giant corporations that own vast amounts of information. The use of Big Data should be monitored and better regulated at the national and international levels. Barocas and Nissenbaum argue that one way of protecting individual users is by being informed about the types of information being collected, with whom it is shared, under what constrains and for what purposes.\n\n\n=== Critiques of the 'V' model ===\nThe 'V' model of Big Data is concerting as it centres around computational scalability and lacks in a loss around the perceptibility and understandability of information. This led to the framework of cognitive big data, which characterizes Big Data application according to:\nData completeness: understanding of the non-obvious from data;\nData correlation, causation, and predictability: causality as not essential requirement to achieve predictability;\nExplainability and interpretability: humans desire to understand and accept what they understand, where algorithms don't cope with this;\nLevel of automated decision making: algorithms that support automated decision making and algorithmic self-learning;\n\n\n=== Critiques of novelty ===\nLarge data sets have been analyzed by computing machines for well over a century, including the US census analytics performed by IBM's punch card machines which computed statistics including means and variances of populations across the whole continent.   In more recent decades, science experiments such as CERN have produced data on similar scales to current commercial \"big data\".  However science experiments have tended to analyze their data using specialized custom-built high performance computing (super-computing) clusters and grids, rather than clouds of cheap commodity computers as in the current commercial wave, implying a difference in both culture and technology stack.\n\n\n=== Critiques of big data execution ===\nUlf-Dietrich Reips and Uwe Matzat wrote in 2014 that big data had become a \"fad\" in scientific research. Researcher Danah Boyd has raised concerns about the use of big data in science neglecting principles such as choosing a representative sample by being too concerned about handling the huge amounts of data. This approach may lead to results bias in one way or another. Integration across heterogeneous data resources\u2014some that might be considered big data and others not\u2014presents formidable logistical as well as analytical challenges, but many researchers argue that such integrations are likely to represent the most promising new frontiers in science.\nIn the provocative article \"Critical Questions for Big Data\", the authors title big data a part of mythology: \"large data sets offer a higher form of intelligence and knowledge [...], with the aura of truth, objectivity, and accuracy\". Users of big data are often \"lost in the sheer volume of numbers\", and \"working with Big Data is still subjective, and what it quantifies does not necessarily have a closer claim on objective truth\". Recent developments in BI domain, such as pro-active reporting especially target improvements in usability of big data, through automated filtering of non-useful data and correlations. Big structures are full of spurious correlations either because of non-causal coincidences (law of truly large numbers), solely nature of big randomness (Ramsey theory) or existence of non included factors so the hope, of early experimenters to make large databases of numbers \"speak for themselves\" and revolutionize scientific method, is questioned.Big data analysis is often shallow compared to analysis of smaller data sets. In many big data projects, there is no large data analysis happening, but the challenge is the extract, transform, load part of data pre-processing.Big data is a buzzword and a \"vague term\", but at the same time an \"obsession\" with entrepreneurs, consultants, scientists and the media. Big data showcases such as Google Flu Trends failed to deliver good predictions in recent years, overstating the flu outbreaks by a factor of two. Similarly, Academy awards and election predictions solely based on Twitter were more often off than on target.\nBig data often poses the same challenges as small data; adding more data does not solve problems of bias, but may emphasize other problems. In particular data sources such as Twitter are not representative of the overall population, and results drawn from such sources may then lead to wrong conclusions. Google Translate\u2014which is based on big data statistical analysis of text\u2014does a good job at translating web pages. However, results from specialized domains may be dramatically skewed.\nOn the other hand, big data may also introduce new problems, such as the multiple comparisons problem: simultaneously testing a large set of hypotheses is likely to produce many false results that mistakenly appear significant.\nIoannidis argued that \"most published research findings are false\" due to essentially the same effect: when many scientific teams and researchers each perform many experiments (i.e. process a big amount of scientific data; although not with big data technology), the likelihood of a \"significant\" result being false grows fast \u2013 even more so, when only positive results are published.\nFurthermore, big data analytics results are only as good as the model on which they are predicated.  In an example, big data took part in attempting to predict the results of the 2016 U.S. Presidential Election with varying degrees of success.\n\n\n=== Critiques of big data policing and surveillance ===\nBig Data has been used in policing and surveillance by institutions like law enforcement and corporations. Due to the less visible nature of data-based surveillance as compared to traditional method of policing, objections to big data policing are less likely to arise. According to Sarah Brayne\u2019s Big Data Surveillance: The Case of Policing, big data policing can reproduce existing societal inequalities in three ways:\n\nPlacing suspected criminals under increased surveillance by using the justification of a mathematical and therefore unbiased algorithm;\nIncreasing the scope and number of people that are subject to law enforcement tracking and exacerbating existing racial overrepresentation in the criminal justice system;\nEncouraging members of society to abandon interactions with institutions that would create a digital trace, thus creating obstacles to social inclusion.If these potential problems are not corrected or regulating, the effects of big data policing continue to shape societal hierarchies. Conscientious usage of big data policing could prevent individual level biases from becoming institutional biases, Brayne also notes.\n\n\n== See also ==\n\n\n== References ==\n\n\n== Further reading ==\nPeter Kinnaird; Inbal Talgam-Cohen, eds. (2012). \"Big Data\". ACM Crossroads student magazine. XRDS: Crossroads, The ACM Magazine for Students. Vol. 19 no. 1. Association for Computing Machinery. ISSN 1528-4980. OCLC 779657714.\nJure Leskovec; Anand Rajaraman; Jeffrey D. Ullman (2014). Mining of massive datasets. Cambridge University Press. ISBN 9781107077232. OCLC 888463433.\nViktor Mayer-Sch\u00f6nberger; Kenneth Cukier (2013). Big Data: A Revolution that Will Transform how We Live, Work, and Think. Houghton Mifflin Harcourt. ISBN 9781299903029. OCLC 828620988.\nPress, Gil (9 May 2013). \"A Very Short History Of Big Data\". forbes.com. Jersey City, NJ: Forbes Magazine. Retrieved 17 September 2016.\n\"Big Data: The Management Revolution\". hbr.org. Harvard Business Review.\nO'Neil, Cathy (2017). Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Broadway Books. ISBN 978-0553418835.\n\n\n== External links ==\n Media related to Big data at Wikimedia Commons\n The dictionary definition of big data at Wiktionary"}, {"job": "data scientist", "skill": "python language", "keywords": "python language", "description": "Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python's design philosophy emphasizes code readability with its notable use of significant whitespace. Its language constructs and object-oriented approach aim to help programmers write clear, logical code for small and large-scale projects.Python is dynamically typed and garbage-collected. It supports multiple programming paradigms, including procedural, object-oriented, and  functional programming. Python is often described as a \"batteries included\" language due to its comprehensive standard library.Python was conceived in the late 1980s as a successor to the ABC language. Python 2.0, released in 2000, introduced features like list comprehensions and a garbage collection system capable of collecting reference cycles. Python 3.0, released in 2008, was a major revision of the language that is not completely backward-compatible, and much Python 2 code does not run unmodified on Python 3.\nThe Python 2 language, i.e. Python 2.7.x, is \"sunsetting\" in less than a month on January 1, 2020 (after extension; first planned for 2015), and the Python team of volunteers will not fix security issues, or improve it in other ways after that date. With the end-of-life, only  Python 3.5.x and later will be supported.\nPython interpreters are available for many operating systems. A global community of programmers develops and maintains CPython, an open source reference implementation. A non-profit organization, the Python Software Foundation, manages and directs resources for Python and CPython development.\n\n\n== History ==\n\nPython was conceived in the late 1980s by Guido van Rossum at Centrum Wiskunde & Informatica (CWI) in the Netherlands as a successor to the ABC language (itself inspired by SETL), capable of exception handling and interfacing with the Amoeba operating system. Its implementation began in December 1989. Van Rossum shouldered sole responsibility for the project, as the lead developer, until July 12, 2018, when he announced his \"permanent vacation\" from his responsibilities as Python's Benevolent Dictator For Life, a title the Python community bestowed upon him to reflect his long-term commitment as the project's chief decision-maker. He now shares his leadership as a member of a five-person steering council. In January, 2019, active Python core developers elected Brett Cannon, Nick Coghlan, Barry Warsaw, Carol Willing and Van Rossum to a five-member \"Steering Council\" to lead the project.Python 2.0 was released on 16 October 2000 with many major new features, including a cycle-detecting garbage collector and support for Unicode.Python 3.0 was released on 3 December 2008. It was a major revision of the language that is not completely backward-compatible. Many of its major features were backported to Python 2.6.x and 2.7.x version series.  Releases of Python 3 include the 2to3 utility, which automates (at least partially) the translation of Python 2 code to Python 3.Python 2.7's end-of-life date was initially set at 2015 then postponed to 2020 out of concern that a large body of existing code could not easily be forward-ported to Python 3.\n\n\n== Features and philosophy ==\nPython is a multi-paradigm programming language. Object-oriented programming and structured programming are fully supported, and many of its features support functional programming and aspect-oriented programming (including by metaprogramming and metaobjects (magic methods)). Many other paradigms are supported via extensions, including design by contract and logic programming.Python uses dynamic typing and a combination of reference counting and a cycle-detecting garbage collector for memory management. It also features dynamic name resolution (late binding), which binds method and variable names during program execution.\nPython's design offers some support for functional programming in the Lisp tradition. It has filter, map, and reduce functions; list comprehensions, dictionaries, sets, and generator expressions. The standard library has two modules (itertools and functools) that implement functional tools borrowed from Haskell and Standard ML.The language's core philosophy is summarized in the document The Zen of Python (PEP 20), which includes aphorisms such as:\nBeautiful is better than ugly.\nExplicit is better than implicit.\nSimple is better than complex.\nComplex is better than complicated.\nReadability counts.Rather than having all of its functionality built into its core, Python was designed to be highly extensible. This compact modularity has made it particularly popular as a means of adding programmable interfaces to existing applications. Van Rossum's vision of a small core language with a large standard library and easily extensible interpreter stemmed from his frustrations with ABC, which espoused the opposite approach.Python strives for a simpler, less-cluttered syntax and grammar while giving developers a choice in their coding methodology. In contrast to Perl's \"there is more than one way to do it\" motto, Python embraces a \"there should be one\u2014and preferably only one\u2014obvious way to do it\" design philosophy. Alex Martelli, a Fellow at the Python Software Foundation and Python book author, writes that \"To describe something as 'clever' is not considered a compliment in the Python culture.\"Python's developers strive to avoid premature optimization, and reject patches to non-critical parts of the CPython reference implementation that would offer marginal increases in speed at the cost of clarity. When speed is important, a Python programmer can move time-critical functions to extension modules written in languages such as C, or use PyPy, a just-in-time compiler. Cython is also available, which translates a Python script into C and makes direct C-level API calls into the Python interpreter.\nAn important goal of Python's developers is keeping it fun to use. This is reflected in the language's name\u2014a tribute to the British comedy group Monty Python\u2014and in occasionally playful approaches to tutorials and reference materials, such as examples that refer to spam and eggs (from a famous Monty Python sketch) instead of the standard foo and bar.A common neologism in the Python community is pythonic, which can have a wide range of meanings related to program style. To say that code is pythonic is to say that it uses Python idioms well, that it is natural or shows fluency in the language, that it conforms with Python's minimalist philosophy and emphasis on readability. In contrast, code that is difficult to understand or reads like a rough transcription from another programming language is called unpythonic.\nUsers and admirers of Python, especially those considered knowledgeable or experienced, are often referred to as Pythonistas.\n\n\n== Syntax and semantics ==\n\nPython is meant to be an easily readable language. Its formatting is visually uncluttered, and it often uses English keywords where other languages use punctuation. Unlike many other languages, it does not use curly brackets to delimit blocks, and semicolons after statements are optional. It has fewer syntactic exceptions and special cases than C or Pascal.\n\n\n=== Indentation ===\n\nPython uses whitespace indentation, rather than curly brackets or keywords, to delimit blocks. An increase in indentation comes after certain statements; a decrease in indentation signifies the end of the current block. Thus, the program's visual structure accurately represents the program's semantic structure. This feature is sometimes termed the off-side rule, which some other languages share, but in most languages indentation doesn't have any semantic meaning.\n\n\n=== Statements and control flow ===\nPython's statements include (among others):\n\nThe assignment statement (token '=', the equals sign). This operates differently than in traditional imperative programming languages, and this fundamental mechanism (including the nature of Python's version of variables) illuminates many other features of the language. Assignment in C, e.g., x = 2, translates to \"typed variable name x receives a copy of numeric value 2\". The (right-hand) value is copied into an allocated storage location for which the (left-hand) variable name is the symbolic address. The memory allocated to the variable is large enough (potentially quite large) for the declared type. In the simplest case of Python assignment, using the same example, x = 2, translates to \"(generic) name x receives a reference to a separate, dynamically allocated object of numeric (int) type of value 2.\" This is termed binding the name to the object. Since the name's storage location doesn't contain the indicated value, it is improper to call it a variable. Names may be subsequently rebound at any time to objects of greatly varying types, including strings, procedures, complex objects with data and methods, etc. Successive assignments of a common value to multiple names, e.g., x = 2; y = 2; z = 2 result in allocating storage to (at most) three names and one numeric object, to which all three names are bound. Since a name is a generic reference holder it is unreasonable to associate a fixed data type with it. However at a given time a name will be bound to some object, which will have a type; thus there is dynamic typing.\nThe if statement, which conditionally executes a block of code, along with else and elif (a contraction of else-if).\nThe for statement, which iterates over an iterable object, capturing each element to a local variable for use by the attached block.\nThe while statement, which executes a block of code as long as its condition is true.\nThe try statement, which allows exceptions raised in its attached code block to be caught and handled by except clauses; it also ensures that clean-up code in a finally block will always be run regardless of how the block exits.\nThe raise statement, used to raise a specified exception or re-raise a caught exception.\nThe class statement, which executes a block of code and attaches its local namespace to a class, for use in object-oriented programming.\nThe def statement, which defines a function or method.\nThe with statement, from Python 2.5 released on September 2006, which encloses a code block within a context manager (for example, acquiring a lock before the block of code is run and releasing the lock afterwards, or opening a file and then closing it), allowing Resource Acquisition Is Initialization (RAII)-like behavior and replaces a common try/finally idiom.\nThe break statement, exits from the loop.\nThe continue statement, skips this iteration and continues with the next item.\nThe pass statement, which serves as a NOP. It is syntactically needed to create an empty code block.\nThe assert statement, used during debugging to check for conditions that ought to apply.\nThe yield statement, which returns a value from a generator function. From Python 2.5, yield is also an operator. This form is used to implement coroutines.\nThe import statement, which is used to import modules whose functions or variables can be used in the current program. There are three ways of using import: import <module name> [as <alias>] or from <module name> import * or from <module name> import <definition 1> [as <alias 1>], <definition 2> [as <alias 2>], ....\nThe print statement was changed to the print() function in Python 3.Python does not support tail call optimization or first-class continuations, and, according to Guido van Rossum, it never will. However, better support for coroutine-like functionality is provided in 2.5, by extending Python's generators. Before 2.5, generators were lazy iterators; information was passed unidirectionally out of the generator. From Python 2.5, it is possible to pass information back into a generator function, and from Python 3.3, the information can be passed through multiple stack levels.\n\n\n=== Expressions ===\nSome Python expressions are similar to languages such as C and Java, while some are not:\n\nAddition, subtraction, and multiplication are the same, but the behavior of division differs. There are two types of divisions in Python. They are floor division (or integer division) // and floating point/division. Python also added the ** operator for exponentiation.\nFrom Python 3.5, the new @ infix operator was introduced. It is intended to be used by libraries such as NumPy for matrix multiplication.\nFrom Python 3.8, the syntax :=, called the 'walrus operator' was introduced. It assigns values to variables as part of a larger expression.\nIn Python, == compares by value, versus Java, which compares numerics by value and objects by reference. (Value comparisons in Java on objects can be performed with the equals() method.) Python's is operator may be used to compare object identities (comparison by reference). In Python, comparisons may be chained, for example a <= b <= c.\nPython uses the words and, or, not for its boolean operators rather than the symbolic &&, ||, ! used in Java and C.\nPython has a type of expression termed a list comprehension. Python 2.4 extended list comprehensions into a more general expression termed a generator expression.\nAnonymous functions are implemented using lambda expressions; however, these are limited in that the body can only be one expression.\nConditional expressions in Python are written as x if c else y (different in order of operands from the c ? x : y operator common to many other languages).\nPython makes a distinction between lists and tuples. Lists are written as [1, 2, 3], are mutable, and cannot be used as the keys of dictionaries (dictionary keys must be immutable in Python). Tuples are written as (1, 2, 3), are immutable and thus can be used as the keys of dictionaries, provided all elements of the tuple are immutable. The + operator can be used to concatenate two tuples, which does not directly modify their contents, but rather produces a new tuple containing the elements of both provided tuples. Thus, given the variable t initially equal to (1, 2, 3), executing t = t + (4, 5) first evaluates t + (4, 5), which yields (1, 2, 3, 4, 5), which is then assigned back to t, thereby effectively \"modifying the contents\" of t, while conforming to the immutable nature of tuple objects. Parentheses are optional for tuples in unambiguous contexts.\nPython features sequence unpacking wherein multiple expressions, each evaluating to anything that can be assigned to (a variable, a writable property, etc.), are associated in the identical manner to that forming tuple literals and, as a whole, are put on the left hand side of the equal sign in an assignment statement. The statement expects an iterable object on the right hand side of the equal sign that produces the same number of values as the provided writable expressions when iterated through, and will iterate through it, assigning each of the produced values to the corresponding expression on the left.\nPython has a \"string format\" operator %. This functions analogous to printf format strings in C, e.g. \"spam=%s eggs=%d\" % (\"blah\", 2) evaluates to \"spam=blah eggs=2\". In Python 3 and 2.6+, this was supplemented by the format() method of the str class, e.g. \"spam={0} eggs={1}\".format(\"blah\", 2). Python 3.6 added \"f-strings\": blah = \"blah\"; eggs = 2; f'spam={blah} eggs={eggs}'.\nPython has various kinds of string literals:\nStrings delimited by single or double quote marks. Unlike in Unix shells, Perl and Perl-influenced languages, single quote marks and double quote marks function identically. Both kinds of string use the backslash (\\) as an escape character. String interpolation became available in Python 3.6 as \"formatted string literals\".\nTriple-quoted strings, which begin and end with a series of three single or double quote marks. They may span multiple lines and function like here documents in shells, Perl and Ruby.\nRaw string varieties, denoted by prefixing the string literal with an r. Escape sequences are not interpreted; hence raw strings are useful where literal backslashes are common, such as regular expressions and Windows-style paths. Compare \"@-quoting\" in C#.\nPython has array index and array slicing expressions on lists, denoted as a[key], a[start:stop] or a[start:stop:step]. Indexes are zero-based, and negative indexes are relative to the end. Slices take elements from the start index up to, but not including, the stop index. The third slice parameter, called step or stride, allows elements to be skipped and reversed. Slice indexes may be omitted, for example a[:] returns a copy of the entire list. Each element of a slice is a shallow copy.In Python, a distinction between expressions and statements is rigidly enforced, in contrast to languages such as Common Lisp, Scheme, or Ruby. This leads to duplicating some functionality. For example:\n\nList comprehensions vs. for-loops\nConditional expressions vs. if blocks\nThe eval() vs. exec() built-in functions (in Python 2, exec is a statement); the former is for expressions, the latter is for statements.Statements cannot be a part of an expression, so list and other comprehensions or lambda expressions, all being expressions, cannot contain statements. A particular case of this is that an assignment statement such as a = 1 cannot form part of the conditional expression of a conditional statement. This has the advantage of avoiding a classic C error of mistaking an assignment operator = for an equality operator == in conditions: if (c = 1) { ... } is syntactically valid (but probably unintended) C code but if c = 1: ... causes a syntax error in Python.\n\n\n=== Methods ===\nMethods on objects are functions attached to the object's class; the syntax instance.method(argument) is, for normal methods and functions, syntactic sugar for Class.method(instance, argument). Python methods have an explicit self parameter to access instance data, in contrast to the implicit self (or this) in some other object-oriented programming languages (e.g., C++, Java, Objective-C, or Ruby).\n\n\n=== Typing ===\n\nPython uses duck typing and has typed objects but untyped variable names. Type constraints are not checked at compile time; rather, operations on an object may fail, signifying that the given object is not of a suitable type. Despite being dynamically typed, Python is strongly typed, forbidding operations that are not well-defined (for example, adding a number to a string) rather than silently attempting to make sense of them.\nPython allows programmers to define their own types using classes, which are most often used for object-oriented programming. New instances of classes are constructed by calling the class (for example, SpamClass() or EggsClass()), and the classes are instances of the metaclass type (itself an instance of itself), allowing metaprogramming and reflection.\nBefore version 3.0, Python had two kinds of classes: old-style and new-style. The syntax of both styles is the same, the difference being whether the class object is inherited from, directly or indirectly (all new-style classes inherit from object and are instances of type). In versions of Python 2 from Python 2.2 onwards, both kinds of classes can be used. Old-style classes were eliminated in Python 3.0.\nThe long term plan is to support gradual typing and from Python 3.5, the syntax of the language allows specifying static types but they are not checked in the default implementation, CPython. An experimental optional static type checker named mypy supports compile-time type checking.\n^a Not directly accessible by name \n\n\n=== Mathematics ===\nPython has the usual symbols for arithmetic operators (+, -, *, /), and the remainder operator % (where the remainder can be negative,  e.g. 4%-3 == -2). It also has ** for exponentiation, e.g. 5**3 == 125 and 9**0.5 == 3.0, and a new matrix multiply @ operator is included in version 3.5. I.e. all these operators work as in traditional math; with same precedence rules, the operators infix (or - can additionally be unary). Additionally, it has a unary operator (~), which essentially inverts all the bits of its one argument. For integers, this means ~x=-x-1. Other operators include bitwise shift operators x << y, which shifts x to the left y places, the same as x*(2**y) , and x >> y, which shifts x to the right y places, the same as x//(2**y).The behavior of division has changed significantly over time so that division between integers produces accurate floating point results:\nPython 2.1 and earlier use the C division behavior. The / operator is integer division if both operands are integers, and floating-point division otherwise. Integer division rounds towards 0, e.g. 7/3 == 2 and -7/3 == -2.\nPython 2.2 changes integer division to round towards negative infinity, e.g. 7/3 == 2 and -7/3 == -3. The floor division // operator is introduced. So 7//3 == 2, -7//3 == -3, 7.5//3 == 2.0 and -7.5//3 == -3.0. Adding from __future__ import division causes a module to use Python 3.0 rules for division (see next).\nPython 3.0 changes / to always be floating-point division, e.g. 5/2 == 2.5.In Python terms, / before version 3.0 is classic division, / in versions 3.0 and higher is true division, and // is floor division.Rounding towards negative infinity, though different from most languages, adds consistency. For instance, it means that the equation (a + b)//b == a//b + 1 is always true. It also means that the equation b*(a//b) + a%b == a is valid for both positive and negative values of a. However, maintaining the validity of this equation means that while the result of a%b is, as expected, in the half-open interval [0, b), where b is a positive integer, it has to lie in the interval (b, 0] when b is negative.Python provides a round function for rounding a float to the nearest integer. For tie-breaking, versions before 3 use round-away-from-zero: round(0.5) is 1.0, round(-0.5) is \u22121.0. Python 3 uses round to even: round(1.5) is 2, round(2.5) is 2.Python allows boolean expressions with multiple equality relations in a manner that is consistent with general use in mathematics. For example, the expression a < b < c tests whether a is less than b and b is less than c.  C-derived languages interpret this expression differently: in C, the expression would first evaluate a < b, resulting in 0 or 1, and that result would then be compared with c.Python has extensive built-in support for arbitrary-precision arithmetic. Integers are transparently switched from the machine-supported maximum fixed-precision (usually 32 or 64 bits), belonging to the python type int, to arbitrary precision, belonging to the Python type long, where needed. The latter have an \"L\" suffix in their textual representation. (In Python 3, the distinction between the int and long types was eliminated; this behavior is now entirely contained by the int class.) The Decimal type/class in module decimal (since version 2.4) provides decimal floating point numbers to arbitrary precision and several rounding modes. The Fraction type in module fractions (since version 2.6) provides arbitrary precision for rational numbers.Due to Python's extensive mathematics library, and the third-party library NumPy that further extends the native capabilities, it is frequently used as a scientific scripting language to aid in problems such as numerical data processing and manipulation.\n\n\n== Python programming examples ==\nHello world program:\n\nProgram to calculate factorial of a positive integer:\n\n\n== Libraries ==\nPython's large standard library, commonly cited as one of its greatest strengths, provides tools suited to many tasks. For Internet-facing applications, many standard formats and protocols such as MIME and HTTP are supported. It includes modules for creating graphical user interfaces, connecting to relational databases, generating pseudorandom numbers, arithmetic with arbitrary-precision decimals, manipulating regular expressions, and unit testing.\nSome parts of the standard library are covered by specifications (for example, the Web Server Gateway Interface (WSGI) implementation wsgiref follows PEP 333), but most modules are not. They are specified by their code, internal documentation, and test suites (if supplied). However, because most of the standard library is cross-platform Python code, only a few modules need altering or rewriting for variant implementations.\nAs of November 2019, the Python Package Index (PyPI), the official repository for third-party Python software, contains over 200,000 packages with a wide range of functionality, including:\n\nGraphical user interfaces\nWeb frameworks\nMultimedia\nDatabases\nNetworking\nTest frameworks\nAutomation\nWeb scraping\nDocumentation\nSystem administration\nScientific computing\nText processing\nImage processing\nMachine learning\nData analytics\n\n\n== Development environments ==\n\nMost Python implementations (including CPython) include a read\u2013eval\u2013print loop (REPL), permitting them to function as a command line interpreter for which the user enters statements sequentially and receives results immediately.\nOther shells, including IDLE and IPython, add further abilities such as auto-completion, session state retention and syntax highlighting.\nAs well as standard desktop integrated development environments, there are Web browser-based IDEs; SageMath (intended for developing science and math-related Python programs); PythonAnywhere, a browser-based IDE and hosting environment; and Canopy IDE, a commercial Python IDE emphasizing scientific computing.\n\n\n== Implementations ==\n\n\n=== Reference implementation ===\nCPython is the reference implementation of Python. It is written in C, meeting the C89 standard with several select C99 features. It compiles Python programs into an intermediate bytecode which is then executed by its virtual machine. CPython is distributed with a large standard library written in a mixture of C and native Python. It is available for many platforms, including Windows and most modern Unix-like systems. Platform portability was one of its earliest priorities.\n\n\n=== Other implementations ===\nPyPy is a fast, compliant interpreter of Python 2.7 and 3.5. Its just-in-time compiler brings a significant speed improvement over CPython.Stackless Python is a significant fork of CPython that implements microthreads; it does not use the C memory stack, thus allowing massively concurrent programs. PyPy also has a stackless version.MicroPython and CircuitPython are Python 3 variants optimized for microcontrollers. This includes Lego Mindstorms EV3.RustPython is a Python 3 interpreter written in Rust.\n\n\n=== Unsupported implementations ===\nOther just-in-time Python compilers have been developed, but are now unsupported:\n\nGoogle began a project named Unladen Swallow in 2009, with the aim of speeding up the Python interpreter five-fold by using the LLVM, and of improving its multithreading ability to scale to thousands of cores, while ordinary implementations suffer from the global interpreter lock.\nPsyco is a just-in-time specialising compiler that integrates with CPython and transforms bytecode to machine code at runtime. The emitted code is specialized for certain data types and is faster than standard Python code.In 2005, Nokia released a Python interpreter for the Series 60 mobile phones named PyS60. It includes many of the modules from the CPython implementations and some additional modules to integrate with the Symbian operating system. The project has been kept up-to-date to run on all variants of the S60 platform, and several third-party modules are available. The Nokia N900 also supports Python with GTK widget libraries, enabling programs to be written and run on the target device.\n\n\n=== Cross-compilers to other languages ===\nThere are several compilers to high-level object languages, with either unrestricted Python, a restricted subset of Python, or a language similar to Python as the source language:\n\nJython compiles into Java byte code, which can then be executed by every Java virtual machine implementation. This also enables the use of Java class library functions from the Python program.\nIronPython follows a similar approach in order to run Python programs on the .NET Common Language Runtime.\nThe RPython language can be compiled to C, Java bytecode, or Common Intermediate Language, and is used to build the PyPy interpreter of Python.\nPyjs compiles Python to JavaScript.\nCython compiles Python to C and C++.\nNumba uses LLVM to compile Python to machine code.\nPythran compiles Python to C++.\nSomewhat dated Pyrex (latest release in 2010) and Shed Skin (latest release in 2013) compile to C and C++ respectively.\nGoogle's Grumpy compiles Python to Go.\nMyHDL compiles Python to VHDL.\nNuitka compiles Python into C++.\n\n\n=== Performance ===\nA performance comparison of various Python implementations on a non-numerical (combinatorial) workload was presented at EuroSciPy '13.\n\n\n== Development ==\nPython's development is conducted largely through the Python Enhancement Proposal (PEP) process, the primary mechanism for proposing major new features, collecting community input on issues and documenting Python design decisions. Python coding style is covered in PEP 8. Outstanding PEPs are reviewed and commented on by the Python community and the steering council.Enhancement of the language corresponds with development of the CPython reference implementation. The mailing list python-dev is the primary forum for the language's development. Specific issues are discussed in the Roundup bug tracker maintained at python.org. Development originally took place on a self-hosted source-code repository running Mercurial, until Python moved to GitHub in January 2017.CPython's public releases come in three types, distinguished by which part of the version number is incremented:\n\nBackward-incompatible versions, where code is expected to break and need to be manually ported. The first part of the version number is incremented. These releases happen infrequently\u2014for example, version 3.0 was released 8 years after 2.0.\nMajor or \"feature\" releases, about every 18 months, are largely compatible but introduce new features. The second part of the version number is incremented. Each major version is supported by bugfixes for several years after its release.\nBugfix releases, which introduce no new features, occur about every 3 months and are made when a sufficient number of bugs have been fixed upstream since the last release. Security vulnerabilities are also patched in these releases. The third and final part of the version number is incremented.Python 3.9 alpha1 was announced in November 2019, but the release date for the final version depends on what new proposal for release dates are adopted with three draft proposals under discussion, and a yearly cadence is one option.Many alpha, beta, and release-candidates are also released as previews and for testing before final releases. Although there is a rough schedule for each release, they are often delayed if the code is not ready. Python's development team monitors the state of the code by running the large unit test suite during development, and using the BuildBot continuous integration system.The community of Python developers has also contributed over 206,000 software modules (as of 29 November 2019) to the Python Package Index (PyPI), the official repository of third-party Python libraries.\nThe major academic conference on Python is PyCon. There are also special Python mentoring programmes, such as Pyladies.\n\n\n== Naming ==\nPython's name is derived from the British comedy group Monty Python, whom Python creator Guido van Rossum enjoyed while developing the language. Monty Python references appear frequently in Python code and culture; for example, the metasyntactic variables often used in Python literature are spam and eggs instead of the traditional foo and bar. The official Python documentation also contains various references to Monty Python routines.The prefix Py- is used to show that something is related to Python. Examples of the use of this prefix in names of Python applications or libraries include Pygame, a binding of SDL to Python (commonly used to create games); PyQt and PyGTK, which bind Qt and GTK to Python respectively; and PyPy, a Python implementation originally written in Python.\n\n\n== API documentation generators ==\nPython API documentation generators include:\n\nSphinx\nEpydoc\nHeaderDoc\npydoc\n\n\n== Uses ==\n\nSince 2003, Python has consistently ranked in the top ten most popular programming languages in the TIOBE Programming Community Index where, as of December 2018, it is the third most popular language (behind Java, and C). It was selected Programming Language of the Year in 2007, 2010, and 2018.An empirical study found that scripting languages, such as Python, are more productive than conventional languages, such as C and Java, for programming problems involving string manipulation and search in a dictionary, and determined that memory consumption was often \"better than Java and not much worse than C or C++\".Large organizations that use Python include Wikipedia, Google, Yahoo!, CERN, NASA, Facebook, Amazon, Instagram, Spotify and some smaller entities like ILM and ITA. The social news networking site Reddit is written entirely in Python.Python can serve as a scripting language for web applications, e.g., via mod_wsgi for the Apache web server. With Web Server Gateway Interface, a standard API has evolved to facilitate these applications. Web frameworks like Django, Pylons, Pyramid, TurboGears, web2py, Tornado, Flask, Bottle and Zope support developers in the design and maintenance of complex applications. Pyjs and IronPython can be used to develop the client-side of Ajax-based applications. SQLAlchemy can be used as data mapper to a relational database. Twisted is a framework to program communications between computers, and is used (for example) by Dropbox.\nLibraries such as NumPy, SciPy and Matplotlib allow the effective use of Python in scientific computing, with specialized libraries such as Biopython and Astropy providing domain-specific functionality. SageMath is a mathematical software with a notebook interface programmable in Python: its library covers many aspects of mathematics, including algebra, combinatorics, numerical mathematics, number theory, and calculus.\nPython has been successfully embedded in many software products as a scripting language, including in finite element method software such as Abaqus, 3D parametric modeler like FreeCAD, 3D animation packages such as 3ds Max, Blender, Cinema 4D, Lightwave, Houdini, Maya, modo, MotionBuilder, Softimage, the visual effects compositor Nuke, 2D imaging programs like GIMP, Inkscape, Scribus and Paint Shop Pro, and musical notation programs like scorewriter and capella. GNU Debugger uses Python as a pretty printer to show complex structures such as C++ containers. Esri promotes Python as the best choice for writing scripts in ArcGIS. It has also been used in several video games, and has been adopted as first of the three available programming languages in Google App Engine, the other two being Java and Go.Python is commonly used in artificial intelligence projects with the help of libraries like TensorFlow, Keras and Scikit-learn. As a scripting language with modular architecture, simple syntax and rich text processing tools, Python is often used for natural language processing.Many operating systems include Python as a standard component. It ships with most Linux distributions, AmigaOS 4, FreeBSD (as a package), NetBSD, OpenBSD (as a package) and macOS and can be used from the command line (terminal). Many Linux distributions use installers written in Python: Ubuntu uses the Ubiquity installer, while Red Hat Linux and Fedora use the Anaconda installer. Gentoo Linux uses Python in its package management system, Portage.\nPython is used extensively in the information security industry, including in exploit development.Most of the Sugar software for the One Laptop per Child XO, now developed at Sugar Labs, is written in Python. The Raspberry Pi single-board computer project has adopted Python as its main user-programming language.\nLibreOffice includes Python, and intends to replace Java with Python. Its Python Scripting Provider is a core feature since Version 4.0 from 7 February 2013.\n\n\n== Languages influenced by Python ==\nPython's design and philosophy have influenced many other programming languages:\n\nBoo uses indentation, a similar syntax, and a similar object model.\nCobra uses indentation and a similar syntax, and its \"Acknowledgements\" document lists Python first among languages that influenced it. However, Cobra directly supports design-by-contract, unit tests, and optional static typing.\nCoffeeScript, a programming language that cross-compiles to JavaScript, has Python-inspired syntax.\nECMAScript borrowed iterators and generators from Python.\nGo is designed for the \"speed of working in a dynamic language like Python\" and shares the same syntax for slicing arrays.\nGroovy was motivated by the desire to bring the Python design philosophy to Java.\nJulia was designed \"with true macros [.. and to be] as usable for general programming as Python [and] should be as fast as C\". Calling to or from Julia is possible; to with PyCall.jl and a Python package pyjulia allows calling, in the other direction, from Python.\nKotlin is a functional programming language with an interactive shell similar to Python. However, Kotlin is strongly typed with access to standard Java libraries.\nRuby's creator, Yukihiro Matsumoto, has said: \"I wanted a scripting language that was more powerful than Perl, and more object-oriented than Python. That's why I decided to design my own language.\"\nSwift, a programming language developed by Apple, has some Python-inspired syntax.\nGDScript, dynamically typed programming language used to create video-games. It is extremely similar to Python with a few minor differences.Python's development practices have also been emulated by other languages. For example, the practice of requiring a document describing the rationale for, and issues surrounding, a change to the language (in Python, a PEP) is also used in Tcl and Erlang.Python received TIOBE's Programming Language of the Year awards in 2007, 2010 and 2018. The award is given to the language with the greatest growth in popularity over the year, as measured by the TIOBE index.\n\n\n== See also ==\n\nPython syntax and semantics\npip (package manager)\nIPython\n\n\n== References ==\n\n\n=== Sources ===\n\"Python for Artificial Intelligence\". Wiki.python.org. 19 July 2012. Archived from the original on 1 November 2012. Retrieved 3 December 2012.\nPaine, Jocelyn, ed. (August 2005). \"AI in Python\". AI Expert Newsletter. Amzi!. Retrieved 11 February 2012.\n\"PyAIML 0.8.5 : Python Package Index\". Pypi.python.org. Retrieved 17 July 2013.\nRussell, Stuart J. & Norvig, Peter (2009). Artificial Intelligence: A Modern Approach (3rd ed.). Upper Saddle River, NJ: Prentice Hall. ISBN 978-0-13-604259-4.\n\n\n== Further reading ==\nDowney, Allen B. (May 2012). Think Python: How to Think Like a Computer Scientist (Version 1.6.6 ed.). ISBN 978-0-521-72596-5.\nHamilton, Naomi (5 August 2008). \"The A-Z of Programming Languages: Python\". Computerworld. Archived from the original on 29 December 2008. Retrieved 31 March 2010.\nLutz, Mark (2013). Learning Python (5th ed.). O'Reilly Media. ISBN 978-0-596-15806-4.\nPilgrim, Mark (2004). Dive Into Python. Apress. ISBN 978-1-59059-356-1.\nPilgrim, Mark (2009). Dive Into Python 3. Apress. ISBN 978-1-4302-2415-0.\nSummerfield, Mark (2009). Programming in Python 3 (2nd ed.). Addison-Wesley Professional. ISBN 978-0-321-68056-3.\n\n\n== External links ==\n\nOfficial website \nPython (programming language) at Curlie"}, {"job": "data scientist", "skill": "SQL language", "keywords": "SQL language", "description": "SQL ( (listen) S-Q-L,  \"sequel\"; Structured Query Language) is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS). It is particularly useful in handling structured data, i.e. data incorporating relations among entities and variables.\nSQL offers two main advantages over older read\u2013write APIs such as ISAM or VSAM. Firstly, it introduced the concept of accessing many records with one single command. Secondly, it eliminates the need to specify how to reach a record, e.g. with or without an index.\nOriginally based upon relational algebra and tuple relational calculus, SQL consists of many types of statements, which may be informally classed as sublanguages, commonly: a data query language (DQL), a data definition language (DDL), a data control language (DCL), and a data manipulation language (DML). The scope of SQL includes data query, data manipulation (insert, update and delete), data definition (schema creation and modification), and data access control. Although SQL is essentially a declarative language (4GL), it includes also procedural elements.\nSQL was one of the first commercial languages to utilize Edgar F. Codd\u2019s relational model. The model was described in his influential 1970 paper, \"A Relational Model of Data for Large Shared Data Banks\".  Despite not entirely adhering to the relational model as described by Codd, it became the most widely used database language.SQL became a standard of the American National Standards Institute (ANSI) in 1986, and of the International Organization for Standardization (ISO) in 1987. Since then, the standard has been revised to include a larger set of features. Despite the existence of such standards, most SQL code is not completely portable among different database systems without adjustments.\n\n\n== History ==\nSQL was initially developed at IBM by Donald D. Chamberlin and Raymond F. Boyce after learning about the relational model from Ted Codd in the early 1970s. This version, initially called SEQUEL (Structured English Query Language), was designed to manipulate and retrieve data stored in IBM's original quasi-relational database management system, System R, which a group at IBM San Jose Research Laboratory had developed during the 1970s.Chamberlin and Boyce's first attempt of a relational database language was Square, but it was difficult to use due to subscript notation. After moving to the San Jose Research Laboratory in 1973, they began work on SEQUEL. The acronym SEQUEL was later changed to SQL because \"SEQUEL\" was a trademark of the UK-based Hawker Siddeley Dynamics Engineering Limited company.After testing SQL at customer test sites to determine the usefulness and practicality of the system, IBM began developing commercial products based on their System R prototype including System/38, SQL/DS, and DB2, which were commercially available in 1979, 1981, and 1983, respectively.In the late 1970s, Relational Software, Inc. (now Oracle Corporation) saw the potential of the concepts described by Codd, Chamberlin, and Boyce, and developed their own SQL-based RDMS with aspirations of selling it to the U.S. Navy, Central Intelligence Agency, and other U.S. government agencies. In June 1979, Relational Software, Inc. introduced the first commercially available implementation of SQL, Oracle V2 (Version2) for VAX computers.\nBy 1986, ANSI and ISO standard groups officially adopted the standard \"Database Language SQL\" language definition. New versions of the standard were published in 1989, 1992, 1996, 1999, 2003, 2006, 2008, 2011 and, most recently, 2016.\n\n\n== Design ==\nSQL deviates in several ways from its theoretical foundation, the relational model and its tuple calculus.  In that model, a table is a set of tuples, while in SQL, tables and query results are lists of rows: the same row may occur multiple times, and the order of rows can be employed in queries (e.g. in the LIMIT clause).\nCritics argue that SQL should be replaced with a language that returns strictly to the original foundation: for example, see The Third Manifesto. However, no known proof exists that such uniqueness cannot be added to SQL itself, or at least a variation of SQL. In other words, it's quite possible that SQL can be \"fixed\" or at least improved in this regard such that the industry may not have to switch to a completely different query language to obtain uniqueness. Debate on this remains open.\n\n\n== Syntax ==\n\nThe SQL language is subdivided into several language elements, including:\n\nClauses, which are constituent components of statements and queries. (In some cases, these are optional.)\nExpressions, which can produce either scalar values, or tables consisting of columns and rows of data\nPredicates, which specify conditions that can be evaluated to SQL three-valued logic (3VL) (true/false/unknown) or Boolean truth values and are used to limit the effects of statements and queries, or to change program flow.\nQueries, which retrieve the data based on specific criteria. This is an important element of SQL.\nStatements, which may have a persistent effect on schemata and data, or may control transactions, program flow, connections, sessions, or diagnostics.\nSQL statements also include the semicolon (\";\") statement terminator. Though not required on every platform, it is defined as a standard part of the SQL grammar.\nInsignificant whitespace is generally ignored in SQL statements and queries, making it easier to format SQL code for readability.\n\n\n== Procedural extensions ==\nSQL is designed for a specific purpose: to query data contained in a relational database. SQL is a set-based, declarative programming language, not an imperative programming language like C or BASIC. However, extensions to Standard SQL add procedural programming language functionality, such as control-of-flow constructs. These include:\n\nIn addition to the standard SQL/PSM extensions and proprietary SQL extensions, procedural and object-oriented programmability is available on many SQL platforms via DBMS integration with other languages. The SQL standard defines SQL/JRT extensions (SQL Routines and Types for the Java Programming Language) to support Java code in SQL databases. Microsoft SQL Server 2005 uses the SQLCLR (SQL Server Common Language Runtime) to host managed .NET assemblies in the database, while prior versions of SQL Server were restricted to unmanaged extended stored procedures primarily written in C. PostgreSQL lets users write functions in a wide variety of languages\u2014including Perl, Python, Tcl, JavaScript (PL/V8) and C.\n\n\n== Interoperability and standardization ==\nSQL implementations are incompatible between vendors and do not necessarily completely follow standards. In particular date and time syntax, string concatenation, NULLs, and comparison case sensitivity vary from vendor to vendor. Particular exceptions are PostgreSQL and Mimer SQL which strive for standards compliance, though PostgreSQL does not adhere to the standard in how folding of unquoted names is done. The folding of unquoted names to lower case in PostgreSQL is incompatible with the SQL standard, which says that unquoted names should be folded to upper case. Thus, Foo should be equivalent to FOO not foo according to the standard.\nPopular implementations of SQL commonly omit support for basic features of Standard SQL, such as the DATE or TIME data types.  The most obvious such examples, and incidentally the most popular commercial and proprietary SQL DBMSs, are Oracle (whose DATE behaves as DATETIME, and lacks a TIME type) and MS SQL Server (before the 2008 version). As a result, SQL code can rarely be ported between database systems without modifications.\nThere are several reasons for this lack of portability between database systems:\n\nThe complexity and size of the SQL standard means that most implementors do not support the entire standard.\nThe standard does not specify database behavior in several important areas (e.g. indexes, file storage...), leaving implementations to decide how to behave.\nThe SQL standard precisely specifies the syntax that a conforming database system must implement. However, the standard's specification of the semantics of language constructs is less well-defined, leading to ambiguity.\nMany database vendors have large existing customer bases; where the newer version of the SQL standard conflicts with the prior behavior of the vendor's database, the vendor may be unwilling to break backward compatibility.\nThere is little commercial incentive for vendors to make it easier for users to change database suppliers (see vendor lock-in).\nUsers evaluating database software tend to place other factors such as performance higher in their priorities than standards conformance.SQL was adopted as a standard by the American National Standards Institute (ANSI) in 1986 as SQL-86 and the International Organization for Standardization (ISO) in 1987. It is maintained by ISO/IEC JTC 1, Information technology, Subcommittee SC 32, Data management and interchange. The standard is commonly denoted by the pattern: ISO/IEC 9075-n:yyyy Part n: title, or, as a shortcut, ISO/IEC 9075.\nISO/IEC 9075 is complemented by ISO/IEC 13249: SQL Multimedia and Application Packages (SQL/MM), which defines SQL based interfaces and packages to widely spread applications like video, audio and spatial data.\nUntil 1996, the National Institute of Standards and Technology (NIST) data management standards program certified SQL DBMS compliance with the SQL standard. Vendors now self-certify the compliance of their products.The original standard declared that the official pronunciation for \"SQL\" was an initialism:  (\"ess cue el\"). Regardless, many English-speaking database professionals (including Donald Chamberlin himself) use the acronym-like pronunciation of  (\"sequel\"), mirroring the language's pre-release development name of \"SEQUEL\". The SQL standard has gone through a number of revisions:\n\nInterested parties may purchase SQL standards documents from ISO, IEC or ANSI. A draft of SQL:2008 is freely available as a zip archive.The SQL standard is divided into ten parts.\n\nISO/IEC 9075-1:2016 Part 1: Framework (SQL/Framework). It provides logical concepts.\nISO/IEC 9075-2:2016 Part 2: Foundation (SQL/Foundation). It contains the most central elements of the language and consists of both mandatory and optional features.\nISO/IEC 9075-3:2016 Part 3: Call-Level Interface (SQL/CLI). It defines interfacing components (structures, procedures, variable bindings) that can be used to execute SQL statements from applications written in Ada, C respectively C++, COBOL, Fortran, MUMPS, Pascal or PL/I. (For Java see part 10.) SQL/CLI is defined in such a way that SQL statements and SQL/CLI procedure calls are treated as separate from the calling application's source code. Open Database Connectivity is a well-known superset of SQL/CLI. This part of the standard consists solely of mandatory features.\nISO/IEC 9075-4:2016 Part 4: Persistent stored modules (SQL/PSM). It standardizes procedural extensions for SQL, including flow of control, condition handling, statement condition signals and resignals, cursors and local variables, and assignment of expressions to variables and parameters. In addition, SQL/PSM formalizes declaration and maintenance of persistent database language routines (e.g., \"stored procedures\"). This part of the standard consists solely of optional features.\nISO/IEC 9075-9:2016 Part 9: Management of External Data (SQL/MED). It provides extensions to SQL that define foreign-data wrappers and datalink types to allow SQL to manage external data. External data is data that is accessible to, but not managed by, an SQL-based DBMS. This part of the standard consists solely of optional features.\nISO/IEC 9075-10:2016 Part 10: Object language bindings (SQL/OLB). It defines the syntax and semantics of SQLJ, which is SQL embedded in Java (see also part 3). The standard also describes mechanisms to ensure binary portability of SQLJ applications, and specifies various Java packages and their contained classes. This part of the standard consists solely of optional features. Unlike SQL/OLB JDBC defines an API and is not part of the SQL standard.\nISO/IEC 9075-11:2016 Part 11: Information and definition schemas (SQL/Schemata). It defines the Information Schema and Definition Schema, providing a common set of tools to make SQL databases and objects self-describing. These tools include the SQL object identifier, structure and integrity constraints, security and authorization specifications, features and packages of ISO/IEC 9075, support of features provided by SQL-based DBMS implementations, SQL-based DBMS implementation information and sizing items, and the values supported by the DBMS implementations. This part of the standard contains both mandatory and optional features.\nISO/IEC 9075-13:2016 Part 13: SQL Routines and types using the Java TM programming language (SQL/JRT). It specifies the ability to invoke static Java methods as routines from within SQL applications ('Java-in-the-database'). It also calls for the ability to use Java classes as SQL structured user-defined types. This part of the standard consists solely of optional features.\nISO/IEC 9075-14:2016 Part 14: XML-Related Specifications (SQL/XML). It specifies SQL-based extensions for using XML in conjunction with SQL. The XML data type is introduced, as well as several routines, functions, and XML-to-SQL data type mappings to support manipulation and storage of XML in an SQL database. This part of the standard consists solely of optional features.\nISO/IEC 9075-15:2019 Part 15: Multi-dimensional arrays (SQL/MDA). It specifies a multidimensional array type (MDarray) for SQL, along with operations on MDarrays, MDarray slices, MDarray cells, and related features. This part of the standard consists solely of optional features.ISO/IEC 9075 is complemented by ISO/IEC 13249 SQL Multimedia and Application Packages. This closely related but separate standard is developed by the same committee. It defines interfaces and packages based on SQL. The aim is a unified access to typical database applications like text, pictures, data mining or spatial data.\n\nISO/IEC 13249-1:2016 Part 1: Framework\nISO/IEC 13249-2:2003 Part 2: Full-Text\nISO/IEC 13249-3:2016 Part 3: Spatial\nISO/IEC 13249-5:2003 Part 5: Still image\nISO/IEC 13249-6:2006 Part 6: Data mining\nISO/IEC 13249-7:2013 Part 7: History\nISO/IEC 13249-8:xxxx Part 8: Metadata Registry Access  MRA (work in progress)ISO/IEC 9075 is also accompanied by a series of Technical Reports, published as ISO/IEC TR 19075 in 8 parts. These Technical Reports explain the justification for and usage of some features of SQL, giving examples where appropriate. The Technical Reports are non-normative; if there is any discrepancy from 9075, the text in 9075 holds. Currently available 19075 Technical Reports are:\n\nISO/IEC TR 19075-1:2011 Part 1: XQuery Regular Expression Support in SQL\nISO/IEC TR 19075-2:2015 Part 2: SQL Support for Time-Related Information\nISO/IEC TR 19075-3:2015 Part 3: SQL Embedded in Programs using the JavaTM programming language\nISO/IEC TR 19075-4:2015 Part 4: SQL with Routines and types using the JavaTM programming language\nISO/IEC TR 19075-5:2016 Part 5: Row Pattern Recognition in SQL\nISO/IEC TR 19075-6:2017 Part 6: SQL support for Javascript Object Notation (JSON)\nISO/IEC TR 19075-7:2017 Part 7: Polymorphic table functions in SQL\nISO/IEC TR 19075-8:2019 Part 8: Multi-Dimensional Arrays (SQL/MDA)\n\n\n== Alternatives ==\nA distinction should be made between alternatives to SQL as a language, and alternatives to the relational model itself.  Below are proposed relational alternatives to the SQL language.  See navigational database and NoSQL for alternatives to the relational model.\n\n.QL: object-oriented Datalog\n4D Query Language (4D QL)\nDatalog: critics suggest that Datalog has two advantages over SQL: it has cleaner semantics, which facilitates program understanding and maintenance, and it is more expressive, in particular for recursive queries.\nHTSQL: URL based query method\nIBM Business System 12 (IBM BS12): one of the first fully relational database management systems, introduced in 1982\nISBL\njOOQ: SQL implemented in Java as an internal domain-specific language\nJava Persistence Query Language (JPQL): The query language used by the Java Persistence API and Hibernate persistence library\nJavaScript: MongoDB implements its query language in a JavaScript API.\nLINQ: Runs SQL statements written like language constructs to query collections directly from inside .Net code.\nObject Query Language\nQBE (Query By Example) created by Mosh\u00e8 Zloof, IBM 1977\nQuel introduced in 1974 by the U.C. Berkeley Ingres project.\nTutorial D\nXQuery\n\n\n== Distributed SQL processing ==\nDistributed Relational Database Architecture (DRDA) was designed by a work group within IBM in the period 1988 to 1994. DRDA enables network connected relational databases to cooperate to fulfill SQL requests.An interactive user or program can issue SQL statements to a local RDB and receive tables of data and status indicators in reply from remote RDBs. SQL statements can also be compiled and stored in remote RDBs as packages and then invoked by package name. This is important for the efficient operation of application programs that issue complex, high-frequency queries. It is especially important when the tables to be accessed are located in remote systems.\nThe messages, protocols, and structural components of DRDA are defined by the Distributed Data Management Architecture.\n\n\n== Criticisms ==\nChamberlin's 2012 paper discusses four historical criticisms of SQL:\n\n\n=== Orthogonality and completeness ===\nEarly specifications did not support major features, such as primary keys. Result sets could not be named, and sub-queries had not been defined. These were added in 1992.\n\n\n=== NULLs ===\nSQL's controversial \"NULL\" and three-value logic. Predicates evaluated over nulls return the logical value of \"unknown\" rather than true or false. Features such as outer-join depend on nulls. Null is not equivalent to space. NULL represents no data in the row.\n\n\n=== Duplicates ===\nAnother popular criticism is that it allows duplicate rows, making integration with languages such as Python, whose data types might make it difficult to accurately represent the data, difficult in terms of parsing and by the absence of modularity.This can be avoided declaring a unique constraint with one or more fields that identifies uniquely a row in the table. That constraint could also become the primary key of the table.\n\n\n=== Impedance mismatch ===\nIn a similar sense to Object-relational impedance mismatch, there is a mismatch between the declarative SQL language and the procedural languages that SQL is typically embedded in.\n\n\n== See also ==\nComparison of object-relational database management systems\nComparison of relational database management systems\nD (data language specification)\nD4 (programming language)\nHierarchical model\nList of relational database management systems\nMUMPS\nNoSQL\nQuery by Example\nTransact-SQL\nOnline analytical processing (OLAP)\nOnline transaction processing (OLTP)\nData warehouse\nRelational data stream management system\nStar schema\nSnowflake schema\n\n\n== Notes ==\n\n\n== References ==\n\n\n== Sources ==\n\n\n=== SQL standards documents ===\n\n\n==== ITTF publicly available standards and technical reports ====\nThe ISO/IEC Information Technology Task Force publishes publicly available standards including SQL. Technical Corrigenda (corrections) and Technical Reports (discussion documents) are published there.\nSQL -- Part 1: Framework (SQL/Framework)\n\n\n==== Draft documents ====\nFormal SQL standards are available from ISO and ANSI for a fee. For informative use, as opposed to strict standards compliance, late drafts often suffice.\n\nSQL:2011 draft\nSQL-92 draft\n\n\n== External links ==\n\n1995 SQL Reunion: People, Projects, and Politics, by Paul McJones (ed.): transcript of a reunion meeting devoted to the personal history of relational databases and SQL.\nAmerican National Standards Institute. X3H2 Records, 1978\u20131995 Charles Babbage Institute Collection documents the H2 committee's development of the NDL and SQL standards.\nOral history interview with Donald D. Chamberlin Charles Babbage Institute In this oral history Chamberlin recounts his early life, his education at Harvey Mudd College and Stanford University, and his work on relational database technology. Chamberlin was a member of the System R research team and, with Raymond F. Boyce, developed the SQL database language. Chamberlin also briefly discusses his more recent research on XML query languages.\nComparison of Different SQL Implementations This comparison of various SQL implementations is intended to serve as a guide to those interested in porting SQL code between various RDBMS products, and includes comparisons between SQL:2008, PostgreSQL, DB2, MS SQL Server, MySQL, Oracle, and Informix.\nEvent stream processing with SQL - An introduction to real-time processing of streaming data with continuous SQL queries\nBNF Grammar for ISO/IEC 9075:2003, part 2 SQL/Framework"}, {"job": "data scientist", "skill": "R language", "keywords": "R language", "description": "R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis. Polls, data mining surveys, and studies of scholarly literature databases show substantial increases in popularity; as of November 2019, R ranks 16th in the TIOBE index, a measure of popularity of programming languages.A GNU package, source code for the R software environment is written primarily in C, Fortran, and R itself and is freely available under the GNU General Public License. Pre-compiled binary versions are provided for various operating systems. Although R has a command line interface, there are several graphical user interfaces, such as RStudio, an integrated development environment.\n\n\n== History ==\nR is an implementation of the S programming language combined with lexical scoping semantics, inspired by Scheme. S was created by John Chambers in 1976, while at Bell Labs. There are some important differences, but much of the code written for S runs unaltered.R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team (of which Chambers is a member). R is named partly after the first names of the first two R authors and partly as a play on the name of S. The project was conceived in 1992, with an initial version released in 1995 and a stable beta version in 2000.\n\n\n== Statistical features ==\nR and its libraries implement a wide variety of statistical and graphical techniques, including linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, and others. R is easily extensible through functions and extensions, and the R community is noted for its active contributions in terms of packages. Many of R's standard functions are written in R itself, which makes it easy for users to follow the algorithmic choices made. For computationally intensive tasks, C, C++, and Fortran code can be linked and called at run time. Advanced users can write C, C++, Java, .NET or Python code to manipulate R objects directly. R is highly extensible through the use of user-submitted packages for specific functions or specific areas of study. Due to its S heritage, R has stronger object-oriented programming facilities than most statistical computing languages. Extending R is also eased by its lexical scoping rules.Another strength of R is static graphics, which can produce publication-quality graphs, including mathematical symbols.  Dynamic and interactive graphics are available through additional packages.R has Rd, its own LaTeX-like documentation format, which is used to supply comprehensive documentation, both online in a number of formats and in hard copy.\n\n\n== Programming features ==\nR is an interpreted language; users typically access it through a command-line interpreter. If a user types 2+2 at the R command prompt and presses enter, the computer replies with 4, as shown below:\n\nThis calculation is interpreted as the sum of two single-element vectors, resulting in a single-element vector. The prefix [1] indicates that the list of elements following it on the same line starts with the first element of the vector (a feature that is useful when the output extends over multiple lines).\nLike other similar languages such as APL and MATLAB, R supports matrix arithmetic. R's data structures include vectors, matrices, arrays, data frames (similar to tables in a relational database) and lists. Arrays are stored in column-major order. R's extensible object system includes objects for (among others): regression models, time-series and geo-spatial coordinates. The scalar data type was never a data structure of R. Instead, a scalar is represented as a vector with length one.Many features of R derive from Scheme. R uses S-expressions to represent both data and code. Functions are first-class and can be manipulated in the same way as data objects, facilitating meta-programming, and allow multiple dispatch. Variables in R are lexically scoped and dynamically typed. Function arguments are passed by value, and are lazy\u2014that is to say, they are only evaluated when they are used, not when the function is called.\nR supports procedural programming with functions and, for some functions, object-oriented programming with generic functions. A generic function acts differently depending on the classes of arguments passed to it. In other words, the generic function dispatches the function (method) specific to that  class  of object. For example, R has a generic print function that can print almost every class of object in R with a simple print(objectname) syntax.Although used mainly by statisticians and other practitioners requiring an environment for statistical computation and software development, R can also operate as a general matrix calculation toolbox \u2013 with performance benchmarks comparable to GNU Octave or MATLAB.\n\n\n== Packages ==\nThe capabilities of R are extended through user-created packages, which allow specialised statistical techniques, graphical devices, import/export capabilities, reporting tools (Rmarkdown, knitr, Sweave), etc. These packages are developed primarily in R, and sometimes in Java, C, C++, and Fortran. The R packaging system is also used by researchers to create compendia to organise research data, code and report files in a systematic way for sharing and public archiving.A core set of packages is included with the installation of R, with more than 15,000 additional packages (as of September 2018) available at the Comprehensive R Archive Network (CRAN), Bioconductor, Omegahat, GitHub, and other repositories.The \"Task Views\" page (subject list) on the CRAN website lists a wide range of tasks (in fields such as Finance, Genetics, High Performance Computing, Machine Learning, Medical Imaging, Social Sciences and Spatial Statistics) to which R has been applied and for which packages are available. R has also been identified by the FDA as suitable for interpreting data from clinical research.Other R package resources include Crantastic, a community site for rating and reviewing all CRAN packages, and R-Forge, a central platform for the collaborative development of R packages, R-related software, and projects. R-Forge also hosts many unpublished beta packages, and development versions of CRAN packages.\nThe Bioconductor project provides R packages for the analysis of genomic data. This includes object-oriented data-handling and analysis tools for data from Affymetrix, cDNA microarray, and next-generation high-throughput sequencing methods.\n\n\n== Milestones ==\nA list of changes in R releases is maintained in various \"news\" files at CRAN. Some highlights are listed below for several major releases.\n\n\n== Interfaces ==\nThe most specialized integrated development environment (IDE) for R is RStudio. A similar development interface is R Tools for Visual Studio. Some generic IDEs like Eclipse, also offer features to work with R.\nGraphical user interfaces with more of a point-and-click approach include Rattle GUI, R Commander, and RKWard.\nSome of the more common editors with varying levels of support for R include  Emacs (Emacs Speaks Statistics), Vim (Nvim-R plugin), Neovim (Nvim-R plugin), Kate, LyX, Notepad++, Visual Studio Code, WinEdt, and Tinn-R.R functionality is accessible from several scripting languages such as Python, Perl, Ruby, F#, and Julia. Interfaces to other, high-level programming languages, like Java and .NET C# are available as well.\n\n\n== Implementations ==\nThe main R implementation is written in R, C, and Fortran, and there are several other implementations aimed at improving speed or increasing extensibility. A closely related implementation is pqR (pretty quick R) by Radford M. Neal with improved memory management and support for automatic multithreading. Renjin and FastR are Java implementations of R for use in a Java Virtual Machine.  CXXR, rho, and Riposte are implementations of R in C++.  Renjin, Riposte, and pqR attempt to improve performance by using multiple processor cores and some form of deferred evaluation. Most of these alternative implementations are experimental and incomplete, with relatively few users, compared to the main implementation maintained by the R Development Core Team.\nTIBCO built a runtime engine called TERR, which is part of Spotfire.Microsoft R Open is a fully compatible R distribution with modifications for multi-threaded computations.\n\n\n== Communities ==\nR has local communities worldwide for users to network, share ideas, and learn.There is a growing number of R events bringing its users together, such as conferences (e.g. useR!, WhyR?, conectaR, SatRdays), meetups, as well as R-Ladies groups that promote gender diversity.\n\n\n== useR! conferences ==\nThe official annual gathering of R users is called \"useR!\". The first such event was useR! 2004 in May 2004, Vienna, Austria.  After skipping 2005, the useR! conference has been held annually, usually alternating between locations in Europe and North America. Subsequent conferences have included:\nuseR! 2006, Vienna, Austria\nuseR! 2007, Ames, Iowa, USA\nuseR! 2008, Dortmund, Germany\nuseR! 2009, Rennes, France\nuseR! 2010, Gaithersburg, Maryland, USA\nuseR! 2011, Coventry, United Kingdom\nuseR! 2012, Nashville, Tennessee, USA\nuseR! 2013, Albacete, Spain\nuseR! 2014, Los Angeles, California, USA\nuseR! 2015, Aalborg, Denmark\nuseR! 2016, Stanford, California, USA\nuseR! 2017, Brussels, Belgium\nuseR! 2018, Brisbane, Australia\nuseR! 2019, Toulouse, FranceFuture conferences planned are as follows:\nuseR! 2020, St. Louis, Missouri, USA\n\n\n== The R Journal ==\nThe R Journal is the open access, refereed journal of the R project for statistical computing. It features short to medium length articles on the use and development of R, including packages, programming tips, CRAN news, and foundation news.\n\n\n== Comparison with SAS, SPSS, and Stata ==\nR is comparable to popular commercial statistical packages such as SAS, SPSS, and Stata, but R is available to users at no charge under a free software license.In January 2009, the New York Times ran an article charting the growth of R, the reasons for its popularity among data scientists and the threat it poses to commercial statistical packages such as SAS.  In June 2017 data scientist Robert Muenchen published a more in-depth comparison between R and other software packages, \"The Popularity of Data Science Software\".R is more procedural-code oriented than either SAS or SPSS, both of which make heavy use of pre-programmed procedures (called \"procs\") that are built-in to the language environment and customized by parameters of each call.  R generally processes data in-memory, which limits its usefulness in processing extremely large files.\n\n\n== Commercial support for R ==\n \nAlthough R is an open-source project supported by the community developing it, some companies strive to provide commercial support and/or extensions for their customers. This section gives some examples of such companies.\nIn 2007, Richard Schultz, Martin Schultz, Steve Weston and Kirk Mettler founded Revolution Analytics to provide commercial support for Revolution R, their distribution of R, which also includes components developed by the company. Major additional components include: ParallelR, the R Productivity Environment IDE, RevoScaleR (for big data analysis), RevoDeployR, web services framework, and the ability for reading and writing data in the SAS file format. Revolution Analytics also offer a distribution of R designed to comply with established IQ/OQ/PQ criteria which enables clients in the pharmaceutical sector to validate their installation of REvolution R. In 2015, Microsoft Corporation completed the acquisition of Revolution Analytics. and has since integrated the R programming language into SQL Server 2016, SQL Server 2017, Power BI, Azure SQL Database, Azure Cortana Intelligence, Microsoft R Server and Visual Studio 2017.In October 2011, Oracle announced the Big Data Appliance, which integrates R, Apache Hadoop, Oracle Linux, and a NoSQL database with Exadata hardware. As of 2012, Oracle R Enterprise became one of two components of the \"Oracle Advanced Analytics Option\" (alongside Oracle Data Mining).IBM offers support for in-Hadoop execution of R, and provides a programming model for massively parallel in-database analytics in R.Tibco offers a runtime-version R as a part of Spotfire.Mango Solutions offers a validation package for R, ValidR, to make it compliant with drug approval agencies, like FDA. These agencies allow for the use of any statistical software in submissions, if only the software is validated, either by the vendor or sponsor itself.\n\n\n== Examples ==\n\n\n=== Basic syntax ===\nThe following examples illustrate the basic syntax of the language and use of the command-line interface.\nIn R, the generally preferred assignment operator is an arrow made from two characters <-, although = can usually be used instead.\n\n\n=== Structure of a function ===\nOne of R\u2019s strengths is the ease of creating new functions. Objects in the function body remain local to the function, and any data type may be returned.\nHere is an example user-created function:\n\n\n=== Mandelbrot set ===\nShort R code calculating Mandelbrot set through the first 20 iterations of equation z = z2 + c plotted for different complex constants c.  This example demonstrates:\n\nuse of community-developed external libraries (called packages), in this case caTools package\nhandling of complex numbers\nmultidimensional arrays of numbers used as basic data type, see variables C, Z and X.\n\n\n== See also ==\n\nComparison of numerical analysis software\nComparison of statistical packages\nList of numerical analysis software\nList of statistical packages\nRmetrics\nRStudio\nStatcheck\n\n\n== References ==\n\n\n== External links ==\n\nOfficial website  of the R project"}, {"job": "data scientist", "skill": "supervised learning", "keywords": "supervised learning", "description": "Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples.  In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).  A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a \"reasonable\" way (see inductive bias).\nThe parallel task in human and animal psychology is often referred to as concept learning.\n\n\n== Steps ==\nIn order to solve a given problem of supervised learning, one has to perform the following steps:\n\nDetermine the type of training examples. Before doing anything else, the user should decide what kind of data is to be used as a training set. In the case of handwriting analysis, for example, this might be a single handwritten character, an entire handwritten word, or an entire line of handwriting.\nGather a training set. The training set needs to be representative of the real-world use of the function. Thus, a set of input objects is gathered and corresponding outputs are also gathered, either from human experts or from measurements.\nDetermine the input feature representation of the learned function. The accuracy of the learned function depends strongly on how the input object is represented. Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object. The number of features should not be too large, because of the curse of dimensionality; but should contain enough information to accurately predict the output.\nDetermine the structure of the learned function and corresponding learning algorithm. For example, the engineer may choose to use support vector machines or decision trees.\nComplete the design. Run the learning algorithm on the gathered training set. Some supervised learning algorithms require the user to determine certain control parameters. These parameters may be adjusted by optimizing performance on a subset (called a validation set) of the training set, or via cross-validation.\nEvaluate the accuracy of the learned function. After parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set.\n\n\n== Algorithm choice ==\nA wide range of supervised learning algorithms are available, each with its strengths and weaknesses. There is no single learning algorithm that works best on all supervised learning problems (see the No free lunch theorem).\nThere are four major issues to consider in supervised learning:\n\n\n=== Bias-variance tradeoff ===\n\nA first issue is the tradeoff between bias and variance.  Imagine that we have available several different, but equally good, training data sets.  A learning algorithm is biased for a particular input \n  \n    \n      \n        x\n      \n    \n    {\\displaystyle x}\n   if, when trained on each of these data sets, it is systematically incorrect when predicting the correct output for \n  \n    \n      \n        x\n      \n    \n    {\\displaystyle x}\n  .  A learning algorithm has high variance for a particular input \n  \n    \n      \n        x\n      \n    \n    {\\displaystyle x}\n   if it predicts different output values when trained on different training sets.  The prediction error of a learned classifier is related to the sum of the bias and the variance of the learning algorithm.  Generally, there is a tradeoff between bias and variance.  A learning algorithm with low bias must be \"flexible\" so that it can fit the data well.  But if the learning algorithm is too flexible, it will fit each training data set differently, and hence have high variance.  A key aspect of many supervised learning methods is that they are able to adjust this tradeoff between bias and variance (either automatically or by providing a bias/variance parameter that the user can adjust).\n\n\n=== Function complexity and amount of training data ===\nThe second issue is the amount of training data available relative to the complexity of the \"true\" function (classifier or regression function).  If the true function is simple, then an \"inflexible\" learning algorithm with high bias and low variance will be able to learn it from a small amount of data.  But if the true function is highly complex (e.g., because it involves complex interactions among many different input features and behaves differently in different parts of the input space), then the function will only be able to learn from a very large amount of training data and using a \"flexible\" learning algorithm with low bias and high variance.\n\n\n=== Dimensionality of the input space ===\nA third issue is the dimensionality of the input space.  If the input feature vectors have very high dimension, the learning problem can be difficult even if the true function only depends on a small number of those features.  This is because the many \"extra\" dimensions can confuse the learning algorithm and cause it to have high variance.  Hence, high input dimensional typically requires tuning the classifier to have low variance and high bias.  In practice, if the engineer can manually remove irrelevant features from the input data, this is likely to improve the accuracy of the learned function.  In addition, there are many algorithms for feature selection that seek to identify the relevant features and discard the irrelevant ones.  This is an instance of the more general strategy of dimensionality reduction, which seeks to map the input data into a lower-dimensional space prior to running the supervised learning algorithm.\n\n\n=== Noise in the output values ===\nA fourth issue is the degree of noise in the desired output values (the supervisory target variables).  If the desired output values are often incorrect (because of human error or sensor errors), then the learning algorithm should not attempt to find a function that exactly matches the training examples.  Attempting to fit the data too carefully leads to overfitting.  You can overfit even when there are no measurement errors (stochastic noise) if the function you are trying to learn is too complex for your learning model. In such a situation, the part of the target function that cannot be modeled \"corrupts\" your training data - this phenomenon has been called deterministic noise. When either type of noise is present, it is better to go with a higher bias, lower variance estimator.\nIn practice, there are several approaches to alleviate noise in the output values such as early stopping to prevent overfitting as well as detecting and removing the noisy training examples prior to training the supervised learning algorithm.  There are several algorithms that identify noisy training examples and removing the suspected noisy training examples prior to training has decreased generalization error with statistical significance.\n\n\n=== Other factors to consider (important) ===\nOther factors to consider when choosing and applying a learning algorithm include the following:\n\nHeterogeneity of the data.  If the feature vectors include features of many different kinds (discrete, discrete ordered, counts, continuous values), some algorithms are easier to apply than others.  Many algorithms, including Support Vector Machines, linear regression, logistic regression, neural networks, and nearest neighbor methods, require that the input features be numerical and scaled to similar ranges (e.g., to the [-1,1] interval).  Methods that employ a distance function, such as nearest neighbor methods and support vector machines with Gaussian kernels, are particularly sensitive to this. An advantage of decision trees is that they easily handle heterogeneous data.\nRedundancy in the data.  If the input features contain redundant information (e.g., highly correlated features), some learning algorithms (e.g., linear regression, logistic regression, and distance based methods) will perform poorly because of numerical instabilities.  These problems can often be solved by imposing some form of regularization.\nPresence of interactions and non-linearities.  If each of the features makes an independent contribution to the output, then algorithms based on linear functions (e.g., linear regression, logistic regression, Support Vector Machines, naive Bayes) and distance functions (e.g., nearest neighbor methods, support vector machines with Gaussian kernels) generally perform well.  However, if there are complex interactions among features, then algorithms such as decision trees and neural networks work better, because they are specifically designed to discover these interactions.  Linear methods can also be applied, but the engineer must manually specify the interactions when using them.When considering a new application, the engineer can compare multiple learning algorithms and experimentally determine which one works best on the problem at hand (see cross validation).  Tuning the performance of a learning algorithm can be very time-consuming.  Given fixed resources, it is often better to spend more time collecting additional training data and more informative features than it is to spend extra time tuning the learning algorithms.\n\n\n=== Algorithms ===\nThe most widely used learning algorithms are: \n\nSupport Vector Machines\nlinear regression\nlogistic regression\nnaive Bayes\nlinear discriminant analysis\ndecision trees\nk-nearest neighbor algorithm\nNeural Networks (Multilayer perceptron)\nSimilarity learning\n\n\n== How supervised learning algorithms work ==\nGiven a set of \n  \n    \n      \n        N\n      \n    \n    {\\displaystyle N}\n   training examples of the form \n  \n    \n      \n        {\n        (\n        \n          x\n          \n            1\n          \n        \n        ,\n        \n          y\n          \n            1\n          \n        \n        )\n        ,\n        .\n        .\n        .\n        ,\n        (\n        \n          x\n          \n            N\n          \n        \n        ,\n        \n        \n          y\n          \n            N\n          \n        \n        )\n        }\n      \n    \n    {\\displaystyle \\{(x_{1},y_{1}),...,(x_{N},\\;y_{N})\\}}\n   such that \n  \n    \n      \n        \n          x\n          \n            i\n          \n        \n      \n    \n    {\\displaystyle x_{i}}\n   is the feature vector of the i-th example and \n  \n    \n      \n        \n          y\n          \n            i\n          \n        \n      \n    \n    {\\displaystyle y_{i}}\n   is its label (i.e., class), a learning algorithm seeks a function \n  \n    \n      \n        g\n        :\n        X\n        \u2192\n        Y\n      \n    \n    {\\displaystyle g:X\\to Y}\n  , where \n  \n    \n      \n        X\n      \n    \n    {\\displaystyle X}\n   is the input space and\n\n  \n    \n      \n        Y\n      \n    \n    {\\displaystyle Y}\n   is the output space.  The function \n  \n    \n      \n        g\n      \n    \n    {\\displaystyle g}\n   is an element of some space of possible functions \n  \n    \n      \n        G\n      \n    \n    {\\displaystyle G}\n  , usually called the hypothesis space.  It is sometimes convenient to\nrepresent \n  \n    \n      \n        g\n      \n    \n    {\\displaystyle g}\n   using a scoring function \n  \n    \n      \n        f\n        :\n        X\n        \u00d7\n        Y\n        \u2192\n        \n          R\n        \n      \n    \n    {\\displaystyle f:X\\times Y\\to \\mathbb {R} }\n   such that \n  \n    \n      \n        g\n      \n    \n    {\\displaystyle g}\n   is defined as returning the \n  \n    \n      \n        y\n      \n    \n    {\\displaystyle y}\n   value that gives the highest score: \n  \n    \n      \n        g\n        (\n        x\n        )\n        =\n        \n          \n            \n              arg\n              \u2061\n              max\n            \n            y\n          \n        \n        \n        f\n        (\n        x\n        ,\n        y\n        )\n      \n    \n    {\\displaystyle g(x)={\\underset {y}{\\arg \\max }}\\;f(x,y)}\n  .  Let \n  \n    \n      \n        F\n      \n    \n    {\\displaystyle F}\n   denote the space of scoring functions.\nAlthough \n  \n    \n      \n        G\n      \n    \n    {\\displaystyle G}\n   and \n  \n    \n      \n        F\n      \n    \n    {\\displaystyle F}\n   can be any space of functions, many learning algorithms are probabilistic models where \n  \n    \n      \n        g\n      \n    \n    {\\displaystyle g}\n   takes the form of a conditional probability model \n  \n    \n      \n        g\n        (\n        x\n        )\n        =\n        P\n        (\n        y\n        \n          |\n        \n        x\n        )\n      \n    \n    {\\displaystyle g(x)=P(y|x)}\n  , or \n  \n    \n      \n        f\n      \n    \n    {\\displaystyle f}\n   takes the form of a joint probability model \n  \n    \n      \n        f\n        (\n        x\n        ,\n        y\n        )\n        =\n        P\n        (\n        x\n        ,\n        y\n        )\n      \n    \n    {\\displaystyle f(x,y)=P(x,y)}\n  .  For example, naive Bayes and linear discriminant analysis are joint probability models, whereas logistic regression is a conditional probability model.\nThere are two basic approaches to choosing \n  \n    \n      \n        f\n      \n    \n    {\\displaystyle f}\n   or \n  \n    \n      \n        g\n      \n    \n    {\\displaystyle g}\n  : empirical risk minimization and structural risk minimization.  Empirical risk minimization seeks the function that best fits the training data.  Structural risk minimization includes a penalty function that controls the bias/variance tradeoff.\nIn both cases, it is assumed that the training set consists of a sample of independent and identically distributed pairs, \n  \n    \n      \n        (\n        \n          x\n          \n            i\n          \n        \n        ,\n        \n        \n          y\n          \n            i\n          \n        \n        )\n      \n    \n    {\\displaystyle (x_{i},\\;y_{i})}\n  .  In order to measure how well a function fits the training data, a loss function \n  \n    \n      \n        L\n        :\n        Y\n        \u00d7\n        Y\n        \u2192\n        \n          \n            R\n          \n          \n            \u2265\n            0\n          \n        \n      \n    \n    {\\displaystyle L:Y\\times Y\\to \\mathbb {R} ^{\\geq 0}}\n   is defined.  For training example \n  \n    \n      \n        (\n        \n          x\n          \n            i\n          \n        \n        ,\n        \n        \n          y\n          \n            i\n          \n        \n        )\n      \n    \n    {\\displaystyle (x_{i},\\;y_{i})}\n  , the loss of predicting the value \n  \n    \n      \n        \n          \n            \n              y\n              ^\n            \n          \n        \n      \n    \n    {\\displaystyle {\\hat {y}}}\n   is \n  \n    \n      \n        L\n        (\n        \n          y\n          \n            i\n          \n        \n        ,\n        \n          \n            \n              y\n              ^\n            \n          \n        \n        )\n      \n    \n    {\\displaystyle L(y_{i},{\\hat {y}})}\n  .\nThe risk \n  \n    \n      \n        R\n        (\n        g\n        )\n      \n    \n    {\\displaystyle R(g)}\n   of function \n  \n    \n      \n        g\n      \n    \n    {\\displaystyle g}\n   is defined as the expected loss of \n  \n    \n      \n        g\n      \n    \n    {\\displaystyle g}\n  .  This can be estimated from the training data as\n\n  \n    \n      \n        \n          R\n          \n            e\n            m\n            p\n          \n        \n        (\n        g\n        )\n        =\n        \n          \n            1\n            N\n          \n        \n        \n          \u2211\n          \n            i\n          \n        \n        L\n        (\n        \n          y\n          \n            i\n          \n        \n        ,\n        g\n        (\n        \n          x\n          \n            i\n          \n        \n        )\n        )\n      \n    \n    {\\displaystyle R_{emp}(g)={\\frac {1}{N}}\\sum _{i}L(y_{i},g(x_{i}))}\n  .\n\n\n=== Empirical risk minimization ===\n\nIn empirical risk minimization, the supervised learning algorithm seeks the function \n  \n    \n      \n        g\n      \n    \n    {\\displaystyle g}\n   that minimizes \n  \n    \n      \n        R\n        (\n        g\n        )\n      \n    \n    {\\displaystyle R(g)}\n  .  Hence, a supervised learning algorithm can be constructed by applying an optimization algorithm to find \n  \n    \n      \n        g\n      \n    \n    {\\displaystyle g}\n  .\nWhen \n  \n    \n      \n        g\n      \n    \n    {\\displaystyle g}\n   is a conditional probability distribution \n  \n    \n      \n        P\n        (\n        y\n        \n          |\n        \n        x\n        )\n      \n    \n    {\\displaystyle P(y|x)}\n   and the loss function is the negative log likelihood: \n  \n    \n      \n        L\n        (\n        y\n        ,\n        \n          \n            \n              y\n              ^\n            \n          \n        \n        )\n        =\n        \u2212\n        log\n        \u2061\n        P\n        (\n        y\n        \n          |\n        \n        x\n        )\n      \n    \n    {\\displaystyle L(y,{\\hat {y}})=-\\log P(y|x)}\n  , then empirical risk minimization is equivalent to maximum likelihood estimation.\nWhen \n  \n    \n      \n        G\n      \n    \n    {\\displaystyle G}\n   contains many candidate functions or the training set is not sufficiently large, empirical risk minimization leads to high variance and poor generalization.  The learning algorithm is able\nto memorize the training examples without generalizing well.  This is called overfitting.\n\n\n=== Structural risk minimization ===\nStructural risk minimization seeks to prevent overfitting by incorporating a regularization penalty into the optimization.  The regularization penalty can be viewed as implementing a form of Occam's razor that prefers simpler functions over more complex ones.\nA wide variety of penalties have been employed that correspond to different definitions of complexity.  For example, consider the case where the function \n  \n    \n      \n        g\n      \n    \n    {\\displaystyle g}\n   is a linear function of the form\n\n  \n    \n      \n        g\n        (\n        x\n        )\n        =\n        \n          \u2211\n          \n            j\n            =\n            1\n          \n          \n            d\n          \n        \n        \n          \u03b2\n          \n            j\n          \n        \n        \n          x\n          \n            j\n          \n        \n      \n    \n    {\\displaystyle g(x)=\\sum _{j=1}^{d}\\beta _{j}x_{j}}\n  .A popular regularization penalty is \n  \n    \n      \n        \n          \u2211\n          \n            j\n          \n        \n        \n          \u03b2\n          \n            j\n          \n          \n            2\n          \n        \n      \n    \n    {\\displaystyle \\sum _{j}\\beta _{j}^{2}}\n  , which is the squared Euclidean norm of the weights, also known as the \n  \n    \n      \n        \n          L\n          \n            2\n          \n        \n      \n    \n    {\\displaystyle L_{2}}\n   norm.  Other norms include the \n  \n    \n      \n        \n          L\n          \n            1\n          \n        \n      \n    \n    {\\displaystyle L_{1}}\n   norm, \n  \n    \n      \n        \n          \u2211\n          \n            j\n          \n        \n        \n          |\n        \n        \n          \u03b2\n          \n            j\n          \n        \n        \n          |\n        \n      \n    \n    {\\displaystyle \\sum _{j}|\\beta _{j}|}\n  , and the \n  \n    \n      \n        \n          L\n          \n            0\n          \n        \n      \n    \n    {\\displaystyle L_{0}}\n   norm, which is the number of non-zero  \n  \n    \n      \n        \n          \u03b2\n          \n            j\n          \n        \n      \n    \n    {\\displaystyle \\beta _{j}}\n  s.  The penalty will be denoted by \n  \n    \n      \n        C\n        (\n        g\n        )\n      \n    \n    {\\displaystyle C(g)}\n  .\nThe supervised learning optimization problem is to find the function \n  \n    \n      \n        g\n      \n    \n    {\\displaystyle g}\n   that minimizes\n\n  \n    \n      \n        J\n        (\n        g\n        )\n        =\n        \n          R\n          \n            e\n            m\n            p\n          \n        \n        (\n        g\n        )\n        +\n        \u03bb\n        C\n        (\n        g\n        )\n        .\n      \n    \n    {\\displaystyle J(g)=R_{emp}(g)+\\lambda C(g).}\n  The parameter \n  \n    \n      \n        \u03bb\n      \n    \n    {\\displaystyle \\lambda }\n   controls the bias-variance tradeoff.  When \n  \n    \n      \n        \u03bb\n        =\n        0\n      \n    \n    {\\displaystyle \\lambda =0}\n  , this gives empirical risk minimization with low bias and high variance.  When \n  \n    \n      \n        \u03bb\n      \n    \n    {\\displaystyle \\lambda }\n   is large, the learning algorithm will have high bias and low variance.  The value of \n  \n    \n      \n        \u03bb\n      \n    \n    {\\displaystyle \\lambda }\n   can be chosen empirically via cross validation.\nThe complexity penalty has a Bayesian interpretation as the negative log prior probability of \n  \n    \n      \n        g\n      \n    \n    {\\displaystyle g}\n  , \n  \n    \n      \n        \u2212\n        log\n        \u2061\n        P\n        (\n        g\n        )\n      \n    \n    {\\displaystyle -\\log P(g)}\n  , in which case \n  \n    \n      \n        J\n        (\n        g\n        )\n      \n    \n    {\\displaystyle J(g)}\n   is the posterior probabability of \n  \n    \n      \n        g\n      \n    \n    {\\displaystyle g}\n  .\n\n\n== Generative training ==\nThe training methods described above are discriminative training methods, because they seek to find a function \n  \n    \n      \n        g\n      \n    \n    {\\displaystyle g}\n   that discriminates well between the different output values (see discriminative model).  For the special case where \n  \n    \n      \n        f\n        (\n        x\n        ,\n        y\n        )\n        =\n        P\n        (\n        x\n        ,\n        y\n        )\n      \n    \n    {\\displaystyle f(x,y)=P(x,y)}\n   is a joint probability distribution and the loss function is the negative log likelihood \n  \n    \n      \n        \u2212\n        \n          \u2211\n          \n            i\n          \n        \n        log\n        \u2061\n        P\n        (\n        \n          x\n          \n            i\n          \n        \n        ,\n        \n          y\n          \n            i\n          \n        \n        )\n        ,\n      \n    \n    {\\displaystyle -\\sum _{i}\\log P(x_{i},y_{i}),}\n   a risk minimization algorithm is said to perform generative training, because \n  \n    \n      \n        f\n      \n    \n    {\\displaystyle f}\n   can be regarded as a generative model that explains how the data were generated.  Generative training algorithms are often simpler and more computationally efficient than discriminative training algorithms.  In some cases, the solution can be computed in closed form as in naive Bayes and linear discriminant analysis.\n\n\n== Generalizations ==\nThere are several ways in which the standard supervised learning problem can be generalized:\n\nSemi-supervised learning: In this setting, the desired output values are provided only for a subset of the training data.  The remaining data is unlabeled.\nWeak supervision: In this setting, noisy, limited, or imprecise sources are used to provide supervision signal for labeling training data.\nActive learning: Instead of assuming that all of the training examples are given at the start, active learning algorithms interactively collect new examples, typically by making queries to a human user.  Often, the queries are based on unlabeled data, which is a scenario that combines semi-supervised learning with active learning.\nStructured prediction: When the desired output value is a complex object, such as a parse tree or a labeled graph, then standard methods must be extended.\nLearning to rank: When the input is a set of objects and the desired output is a ranking of those objects, then again the standard methods must be extended.\n\n\n== Approaches and algorithms ==\nAnalytical learning\nArtificial neural network\nBackpropagation\nBoosting (meta-algorithm)\nBayesian statistics\nCase-based reasoning\nDecision tree learning\nInductive logic programming\nGaussian process regression\nGenetic Programming\nGroup method of data handling\nKernel estimators\nLearning Automata\nLearning Classifier Systems\nMinimum message length (decision trees, decision graphs, etc.)\nMultilinear subspace learning\nNaive Bayes classifier\nMaximum entropy classifier\nConditional random field\nNearest Neighbor Algorithm\nProbably approximately correct learning (PAC) learning\nRipple down rules, a knowledge acquisition methodology\nSymbolic machine learning algorithms\nSubsymbolic machine learning algorithms\nSupport vector machines\nMinimum Complexity Machines (MCM)\nRandom Forests\nEnsembles of Classifiers\nOrdinal classification\nData Pre-processing\nHandling imbalanced datasets\nStatistical relational learning\nProaftn, a multicriteria classification algorithm\n\n\n== Applications ==\nBioinformatics\nCheminformatics\nQuantitative structure\u2013activity relationship\nDatabase marketing\nHandwriting recognition\nInformation retrieval\nLearning to rank\nInformation extraction\nObject recognition in computer vision\nOptical character recognition\nSpam detection\nPattern recognition\nSpeech recognition\nSupervised learning is a special case of Downward causation in biological systems\n\n\n== General issues ==\nComputational learning theory\nInductive bias\nOverfitting (machine learning)\n(Uncalibrated) Class membership probabilities\nUnsupervised learning\nVersion spaces\n\n\n== See also ==\nList of datasets for machine learning research\n\n\n== References ==\n\n\n== External links ==\nMachine Learning Open Source Software (MLOSS)"}, {"job": "data scientist", "skill": "unsupervised learning", "keywords": "unsupervised learning", "description": "Unsupervised learning is a type of self-organized Hebbian learning that helps find previously unknown patterns in data set without pre-existing labels. It is also known as self-organization and allows modeling probability densities of given inputs. It is one of the main three categories of machine learning, along with supervised and reinforcement learning. Semi-supervised learning has also been described, and is a hybridization of supervised and unsupervised techniques. \nTwo of the main methods used in unsupervised learning are principal component and cluster analysis. Cluster analysis is used in unsupervised learning to group, or segment, datasets with shared attributes in order to extrapolate algorithmic relationships.  Cluster analysis is a branch of machine learning that groups the data that has not been labelled, classified or categorized. Instead of responding to feedback, cluster analysis identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data. This approach helps detect anomalous data points that do not fit into either group.\nA central application of unsupervised learning is in the field of density estimation in statistics, though unsupervised learning encompasses many other domains involving summarizing and explaining data features.  It could be contrasted with supervised learning by saying that whereas supervised learning intends to infer a conditional probability distribution \n  \n    \n      \n        \n          p\n          \n            X\n          \n        \n        (\n        x\n        \n        \n          |\n        \n        \n        y\n        )\n      \n    \n    {\\textstyle p_{X}(x\\,|\\,y)}\n   conditioned on the label \n  \n    \n      \n        y\n      \n    \n    {\\textstyle y}\n   of input data; unsupervised learning intends to infer an a priori probability distribution \n  \n    \n      \n        \n          p\n          \n            X\n          \n        \n        (\n        x\n        )\n      \n    \n    {\\textstyle p_{X}(x)}\n  .\nGenerative adversarial networks can also be used with unsupervised learning, though they can also be applied to supervised and reinforcement techniques.\n\n\n== Approaches ==\nSome of the most common algorithms used in unsupervised learning include:\n\nClustering\nhierarchical clustering,\nk-means\nmixture models\nDBSCAN\nOPTICS algorithm\nAnomaly detection\nLocal Outlier Factor\nNeural Networks\nAutoencoders\nDeep Belief Nets\nHebbian Learning\nGenerative adversarial networks\nSelf-organizing map\nApproaches for learning latent variable models such as\nExpectation\u2013maximization algorithm (EM)\nMethod of moments\nBlind signal separation techniques\nPrincipal component analysis\nIndependent component analysis\nNon-negative matrix factorization\nSingular value decomposition\n\n\n== Neural networks ==\nThe classical example of unsupervised learning in the study of neural networks is Donald Hebb's principle, that is, neurons that fire together wire together. In Hebbian learning, the connection is reinforced irrespective of an error, but is exclusively a function of the coincidence between action potentials between the two neurons. A similar version that modifies synaptic weights takes into account the time between the action potentials (spike-timing-dependent plasticity or STDP). Hebbian Learning has been hypothesized to underlie a range of cognitive functions, such as pattern recognition and experiential learning.\nAmong neural network models, the self-organizing map (SOM) and adaptive resonance theory (ART) are commonly used in unsupervised learning algorithms. The SOM is a topographic organization in which nearby locations in the map represent inputs with similar properties. The ART model allows the number of clusters to vary with problem size and lets the user control the degree of similarity between members of the same clusters by means of a user-defined constant called the vigilance parameter. ART networks are used for many pattern recognition tasks, such as automatic target recognition and seismic signal processing.\n\n\n== Method of moments ==\nOne of the statistical approaches for unsupervised learning is the method of moments. In the method of moments, the unknown parameters (of interest) in the model are related to the moments of one or more random variables, and thus, these unknown parameters can be estimated given the moments. The moments are usually estimated from samples empirically. The basic moments are first and second order moments. For a random vector, the first order moment is the mean vector, and the second order moment is the covariance matrix (when the mean is zero). Higher order moments are usually represented using tensors which are the generalization of matrices to higher orders as multi-dimensional arrays.\nIn particular, the method of moments is shown to be effective in learning the parameters of latent variable models.\nLatent variable models are statistical models where in addition to the observed variables, a set of latent variables also exists which is not observed. A highly practical example of latent variable models in machine learning is the topic modeling which is a statistical model for generating the words (observed variables) in the document based on the topic (latent variable) of the document. In the topic modeling, the words in the document are generated according to different statistical parameters when the topic of the document is changed. It is shown that method of moments (tensor decomposition techniques) consistently recover the parameters of a large class of latent variable models under some assumptions.The Expectation\u2013maximization algorithm (EM) is also one of the most practical methods for learning latent variable models. However, it can get stuck in local optima, and it is not guaranteed that the algorithm will converge to the true unknown parameters of the model. In contrast, for the method of moments, the global convergence is guaranteed under some conditions.\n\n\n== See also ==\nAutomated machine learning\nCluster analysis\nAnomaly detection\nExpectation\u2013maximization algorithm\nGenerative topographic map\nMeta-learning (computer science)\nMultivariate analysis\nRadial basis function network\nWeak supervision\n\n\n== Notes ==\n\n\n== Further reading ==\nBousquet, O.; von Luxburg, U.; Raetsch, G., eds. (2004). Advanced Lectures on Machine Learning. Springer-Verlag. ISBN 978-3540231226.\nDuda, Richard O.; Hart, Peter E.; Stork, David G. (2001). \"Unsupervised Learning and Clustering\". Pattern classification (2nd ed.). Wiley. ISBN 0-471-05669-3.\nHastie, Trevor; Tibshirani, Robert (2009). The Elements of Statistical Learning: Data mining, Inference, and Prediction. New York: Springer. pp. 485\u2013586. doi:10.1007/978-0-387-84858-7_14. ISBN 978-0-387-84857-0.\nHinton, Geoffrey; Sejnowski, Terrence J., eds. (1999). Unsupervised Learning: Foundations of Neural Computation. MIT Press. ISBN 0-262-58168-X. (This book focuses on unsupervised learning in neural networks)"}, {"job": "data scientist", "skill": "neural network", "keywords": "neural network", "description": "A neural network is a network or circuit of neurons, or in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus a neural network is either a biological neural network, made up of real biological neurons, or an artificial neural network, for solving artificial intelligence (AI) problems. The connections of the biological neuron are modeled as weights. A positive weight reflects an excitatory connection, while negative values mean inhibitory connections. All inputs are modified by a weight and summed. This activity is referred as a linear combination. Finally, an activation function controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be \u22121 and 1.\nThese artificial networks may be used for predictive modeling, adaptive control and applications where they can be trained via a dataset. Self-learning resulting from experience can occur within networks, which can derive conclusions from a complex and seemingly unrelated set of information.\n\n\n== Overview ==\nA biological neural network is composed of a group or groups of chemically connected or functionally associated neurons. A single neuron may be connected to many other neurons and the total number of neurons and connections in a network may be extensive. Connections, called synapses, are usually formed from axons to dendrites, though dendrodendritic synapses and other connections are possible. Apart from the electrical signaling, there are other forms of signaling that arise from neurotransmitter diffusion.\nArtificial intelligence, cognitive modeling, and neural networks are information processing paradigms inspired by the way biological neural systems process data. Artificial intelligence and cognitive modeling try to simulate some properties of biological neural networks. In the artificial intelligence field, artificial neural networks have been applied successfully to speech recognition, image analysis and adaptive control, in order to construct software agents (in computer and video games) or autonomous robots.\nHistorically, digital computers evolved from the von Neumann model, and operate via the execution of explicit instructions via access to memory by a number of processors. On the other hand, the origins of neural networks are based on efforts to model information processing in biological systems. Unlike the von Neumann model, neural network computing does not separate memory and processing.\nNeural network theory has served both to better identify how the neurons in the brain function and to provide the basis for efforts to create artificial intelligence.\n\n\n== History ==\nThe preliminary theoretical base for contemporary neural networks was independently proposed by Alexander Bain (1873) and William James (1890). In their work, both thoughts and body activity resulted from interactions among neurons within the brain.\n\nFor Bain, every activity led to the firing of a certain set of neurons. When activities were repeated, the connections between those neurons strengthened. According to his theory, this repetition was what led to the formation of memory. The general scientific community at the time was skeptical of Bain's theory because it required what appeared to be an inordinate number of neural connections within the brain. It is now apparent that the brain is exceedingly complex and that the same brain \u201cwiring\u201d can handle multiple problems and inputs.\nJames's theory was similar to Bain's, however, he suggested that memories and actions resulted from electrical currents flowing among the neurons in the brain. His model, by focusing on the flow of electrical currents, did not require individual neural connections for each memory or action.\nC. S. Sherrington (1898) conducted experiments to test James's theory. He ran electrical currents down the spinal cords of rats. However, instead of demonstrating an increase in electrical current as projected by James, Sherrington found that the electrical current strength decreased as the testing continued over time. Importantly, this work led to the discovery of the concept of habituation. \nMcCulloch and Pitts  (1943) created a computational model for neural networks based on mathematics and algorithms. They called this model threshold logic. The model paved the way for neural network research to split into two distinct approaches. One approach focused on biological processes in the brain and the other focused on the application of neural networks to artificial intelligence.\nIn the late 1940s psychologist Donald Hebb  created a hypothesis of learning based on the mechanism of neural plasticity that is now known as Hebbian learning. Hebbian learning is considered to be a 'typical' unsupervised learning rule and its later variants were early models for long term potentiation. These ideas started being applied to computational models in 1948 with Turing's B-type machines.\nFarley and Clark (1954) first used computational machines, then called calculators, to simulate a Hebbian network at MIT. Other neural network computational machines were created by Rochester, Holland, Habit, and Duda (1956).\nRosenblatt (1958) created the perceptron, an algorithm for pattern recognition based on a two-layer learning computer network using simple addition and subtraction. With mathematical notation, Rosenblatt also described circuitry not in the basic perceptron, such as the exclusive-or circuit, a circuit whose mathematical computation could not be processed until after the backpropagation algorithm was created by Werbos (1975).\nNeural network research stagnated after the publication of machine learning research by Marvin Minsky and Seymour Papert (1969). They discovered two key issues with the computational machines that processed neural networks. The first issue was that single-layer neural networks were incapable of processing the exclusive-or circuit. The second significant issue was that computers were not sophisticated enough to effectively handle the long run time required by large neural networks. Neural network research slowed until computers achieved greater processing power. Also key in later advances was the backpropagation algorithm which effectively solved the exclusive-or problem (Werbos 1975).The parallel distributed processing of the mid-1980s became popular under the name connectionism. The text by Rumelhart and McClelland (1986) provided a full exposition on the use of connectionism in computers to simulate neural processes.\nNeural networks, as used in artificial intelligence, have traditionally been viewed as simplified models of neural processing in the brain, even though the relation between this model and brain biological architecture is debated, as it is not clear to what degree artificial neural networks mirror brain function.\n\n\n== Artificial intelligence ==\n\nA neural network (NN), in the case of artificial neurons called artificial neural network (ANN) or simulated neural network (SNN), is an interconnected group of natural or artificial neurons that uses a mathematical or computational model for information processing based on a connectionistic approach to computation. In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network.\nIn more practical terms neural networks are non-linear statistical data modeling or decision making tools. They can be used to model complex relationships between inputs and outputs or to find patterns in data.\nAn artificial neural network involves a network of simple processing elements (artificial neurons) which can exhibit complex global behavior, determined by the connections between the processing elements and element parameters. Artificial neurons were first proposed in 1943 by Warren McCulloch, a neurophysiologist, and Walter Pitts, a logician, who first collaborated at the University of Chicago.One classical type of artificial neural network is the recurrent Hopfield network.\nThe concept of a neural network appears to have first been proposed by Alan Turing in his 1948 paper Intelligent Machinery in which called them \"B-type unorganised machines\".The utility of artificial neural network models lies in the fact that they can be used to infer a function from observations and also to use it. Unsupervised neural networks can also be used to learn representations of the input that capture the salient characteristics of the input distribution, e.g., see the Boltzmann machine (1983), and more recently, deep learning algorithms, which can implicitly learn the distribution function of the observed data. Learning in neural networks is particularly useful in applications where the complexity of the data or task makes the design of such functions by hand impractical.\n\n\n== Applications ==\nNeural networks can be used in different fields. The tasks to which artificial neural networks are applied tend to fall within the following broad categories:\n\nFunction approximation, or regression analysis, including time series prediction and modeling.\nClassification, including pattern and sequence recognition, novelty detection and sequential decision making.\nData processing, including filtering, clustering, blind signal separation and compression.Application areas of ANNs include nonlinear system identification and control (vehicle control, process control), game-playing and decision making (backgammon, chess, racing), pattern recognition (radar systems, face identification, object recognition), sequence recognition (gesture, speech, handwritten text recognition), medical diagnosis, financial applications, data mining (or knowledge discovery in databases, \"KDD\"), visualization and e-mail spam filtering. For example, it is possible to create a semantic profile of user's interests emerging from pictures trained for object recognition.\n\n\n== Neuroscience ==\nTheoretical and computational neuroscience is the field concerned with the theoretical analysis and computational modeling of biological neural systems.\nSince neural systems are intimately related to cognitive processes and behaviour, the field is closely related to cognitive and behavioural modeling.\nThe aim of the field is to create models of biological neural systems in order to understand how biological systems work. To gain this understanding, neuroscientists strive to make a link between observed biological processes (data), biologically plausible mechanisms for neural processing and learning (biological neural network models) and theory (statistical learning theory and information theory).\n\n\n=== Types of models ===\nMany models are used; defined at different levels of abstraction, and modeling different aspects of neural systems. They range from models of the short-term behaviour of individual neurons, through models of the dynamics of neural circuitry arising from interactions between individual neurons, to models of behaviour arising from abstract neural modules that represent complete subsystems. These include models of the long-term and short-term plasticity of neural systems and its relation to learning and memory, from the individual neuron to the system level.\n\n\n== Criticism ==\nA common criticism of neural networks, particularly in robotics, is that they require a large diversity of training for real-world operation. This is not surprising, since any learning machine needs sufficient representative examples in order to capture the underlying structure that allows it to generalize to new cases. Dean Pomerleau, in his research presented in the paper \"Knowledge-based Training of Artificial Neural Networks for Autonomous Robot Driving,\" uses a neural network to train a robotic vehicle to drive on multiple types of roads (single lane, multi-lane, dirt, etc.). A large amount of his research is devoted to (1) extrapolating multiple training scenarios from a single training experience, and (2) preserving past training diversity so that the system does not become overtrained (if, for example, it is presented with a series of right turns\u2014it should not learn to always turn right). These issues are common in neural networks that must decide from amongst a wide variety of responses, but can be dealt with in several ways, for example by randomly shuffling the training examples, by using a numerical optimization algorithm that does not take too large steps when changing the network connections following an example, or by grouping examples in so-called mini-batches.\nA. K. Dewdney, a former Scientific American columnist, wrote in 1997, \"Although neural nets do solve a few toy problems, their powers of computation are so limited that I am surprised anyone takes them seriously as a general problem-solving tool\" (Dewdney, p. 82).\nArguments for Dewdney's position are that to implement large and effective software neural networks, much processing and storage resources need to be committed. While the brain has hardware tailored to the task of processing signals through a graph of neurons, simulating even a most simplified form on Von Neumann technology may compel a neural network designer to fill many millions of database rows for its connections\u2014which can consume vast amounts of computer memory and hard disk space. Furthermore, the designer of neural network systems will often need to simulate the transmission of signals through many of these connections and their associated neurons\u2014which must often be matched with incredible amounts of CPU processing power and time. While neural networks often yield effective programs, they too often do so at the cost of efficiency (they tend to consume considerable amounts of time and money).\nArguments against Dewdney's position are that neural nets have been successfully used to solve many complex and diverse tasks, such as autonomously flying aircraft.Technology writer Roger Bridgman commented on Dewdney's statements about neural nets: \n\nNeural networks, for instance, are in the dock not only because they have been hyped to high heaven, (what hasn't?) but also because you could create a successful net without understanding how it worked: the bunch of numbers that captures its behaviour would in all probability be \"an opaque, unreadable table...valueless as a scientific resource\".\nIn spite of his emphatic declaration that science is not technology, Dewdney seems here to pillory neural nets as bad science when most of those devising them are just trying to be good engineers. An unreadable table that a useful machine could read would still be well worth having.\n\nAlthough it is true that analyzing what has been learned by an artificial neural network is difficult, it is much easier to do so than to analyze what has been learned by a biological neural network. Moreover, recent emphasis on the explainability of AI has contributed towards the development of methods, notably those based on attention mechanisms, for visualizing and explaining learned neural networks. Furthermore, researchers involved in exploring learning algorithms for neural networks are gradually uncovering generic principles which allow a learning machine to be successful. For example, Bengio and LeCun (2007) wrote an article regarding local vs non-local learning, as well as shallow vs deep architecture.Some other criticisms came from believers of hybrid models (combining neural networks and symbolic approaches). They advocate the intermix of these two approaches and believe that hybrid models can better capture the mechanisms of the human mind (Sun and Bookman, 1990).\n\n\n== Recent improvements ==\nWhile initially research had been concerned mostly with the electrical characteristics of neurons, a particularly important part of the investigation in recent years has been the exploration of the role of neuromodulators such as dopamine, acetylcholine, and serotonin on behaviour and learning.\nBiophysical models, such as BCM theory, have been important in understanding mechanisms for synaptic plasticity, and have had applications in both computer science and neuroscience. Research is ongoing in understanding the computational algorithms used in the brain, with some recent biological evidence for radial basis networks and neural backpropagation as mechanisms for processing data.\nComputational devices have been created in CMOS for both biophysical simulation and neuromorphic computing. More recent efforts show promise for creating nanodevices for very large scale principal components analyses and convolution. If successful, these efforts could usher in a new era of neural computing that is a step beyond digital computing, because it depends on learning rather than programming and because it is fundamentally analog rather than digital even though the first instantiations may in fact be with CMOS digital devices.\nBetween 2009 and 2012, the recurrent neural networks and deep feedforward neural networks developed in the research group of J\u00fcrgen Schmidhuber at the Swiss AI Lab IDSIA have won eight international competitions in pattern recognition and machine learning. For example, multi-dimensional long short term memory (LSTM) won three competitions in connected handwriting recognition at the 2009 International Conference on Document Analysis and Recognition (ICDAR), without any prior knowledge about the three different languages to be learned.\nVariants of the back-propagation algorithm as well as unsupervised methods by Geoff Hinton and colleagues at the University of Toronto can be used to train deep, highly nonlinear neural architectures, similar to the 1980 Neocognitron by Kunihiko Fukushima, and the \"standard architecture of vision\", inspired by the simple and complex cells identified by David H. Hubel and Torsten Wiesel in the primary visual cortex.\nRadial basis function and wavelet networks have also been introduced. These can be shown to offer best approximation properties and have been applied in nonlinear system identification and classification applications.Deep learning feedforward networks alternate convolutional layers and max-pooling layers, topped by several pure classification layers. Fast GPU-based implementations of this approach have won several pattern recognition contests, including the IJCNN 2011 Traffic Sign Recognition Competition and the  ISBI 2012 Segmentation of Neuronal Structures in Electron Microscopy Stacks challenge. Such neural networks also were the first artificial pattern recognizers to achieve human-competitive or even superhuman performance on benchmarks such as traffic sign recognition (IJCNN 2012), or the MNIST handwritten digits problem of Yann LeCun and colleagues at NYU.\n\n\n== See also ==\n\n\n== References ==\n\n\n== External links ==\n\nA Brief Introduction to Neural Networks (D. Kriesel) - Illustrated, bilingual manuscript about artificial neural networks; Topics so far: Perceptrons, Backpropagation, Radial Basis Functions, Recurrent Neural Networks, Self Organizing Maps, Hopfield Networks.\nReview of Neural Networks in Materials Science\nArtificial Neural Networks Tutorial in three languages (Univ. Polit\u00e9cnica de Madrid)\nAnother introduction to ANN\nNext Generation of Neural Networks - Google Tech Talks\nPerformance of Neural Networks\nNeural Networks and Information\nSanderson, Grant (October 5, 2017). \"But what is a Neural Network?\". 3Blue1Brown \u2013 via YouTube."}, {"job": "data scientist", "skill": "deep learning", "keywords": "deep learning", "description": "Deep learning  (also known as deep structured learning  or hierarchical learning) is part of a broader family of machine learning methods based on artificial neural networks. Learning can be supervised, semi-supervised or unsupervised.Deep learning architectures such as deep neural networks, deep belief networks, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, audio recognition, social network filtering, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases superior to human experts.Artificial Neural Networks (ANNs) were inspired by information processing and distributed communication nodes in biological systems. ANNs have various differences from biological brains.  Specifically, neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analog.\n\n\n== Definition ==\n\nDeep learning is a class of machine learning algorithms that uses multiple layers to progressively extract higher level features from the raw input. For example, in image processing, lower layers may identify edges, while higher layers may identify the concepts relevant to a human such as digits or letters or faces.\n\n\n== Overview ==\nMost modern deep learning models are based on artificial neural networks, specifically, Convolutional Neural Networks (CNN)s, although they can also include propositional formulas or latent variables organized layer-wise in deep generative models such as the nodes in deep belief networks and deep Boltzmann machines.In deep learning, each level learns to transform its input data into a slightly more abstract and composite representation. In an image recognition application, the raw input may be a matrix of pixels; the first representational layer may abstract the pixels and encode edges; the second layer may compose and encode arrangements of edges; the third layer may encode a nose and eyes; and the fourth layer may recognize that the image contains a face. Importantly, a deep learning process can learn which features to optimally place in which level on its own. (Of course, this does not completely eliminate the need for hand-tuning; for example, varying numbers of layers and layer sizes can provide different degrees of abstraction.)The word \"deep\" in \"deep learning\" refers to the number of layers through which the data is transformed. More precisely, deep learning systems have a substantial credit assignment path (CAP) depth. The CAP is the chain of transformations from input to output. CAPs describe potentially causal connections between input and output. For a feedforward neural network, the depth of the CAPs is that of the network and is the number of hidden layers plus one (as the output layer is also parameterized). For recurrent neural networks, in which a signal may propagate through a layer more than once, the CAP depth is potentially unlimited. No universally agreed upon threshold of depth divides shallow learning from deep learning, but most researchers agree that deep learning involves CAP depth higher than 2. CAP of depth 2 has been shown to be a universal approximator in the sense that it can emulate any function. Beyond that, more layers do not add to the function approximator ability of the network. Deep models (CAP > 2) are able to extract better features than shallow models and hence, extra layers help in learning the features effectively.\nDeep learning architectures can be constructed with a greedy layer-by-layer method. Deep learning helps to disentangle these abstractions and pick out which features improve performance.For supervised learning tasks, deep learning methods eliminate feature engineering, by translating the data into compact intermediate representations akin to principal components, and derive layered structures that remove redundancy in representation.\nDeep learning algorithms can be applied to unsupervised learning tasks. This is an important benefit because unlabeled data are more abundant than the labeled data. Examples of deep structures that can be trained in an unsupervised manner are neural history compressors and deep belief networks.\n\n\n== Interpretations ==\nDeep neural networks are generally interpreted in terms of the universal approximation theorem or probabilistic inference.The classic universal approximation theorem concerns the capacity of feedforward neural networks with a single hidden layer of finite size to approximate continuous functions. In 1989, the first proof was published by George Cybenko for sigmoid activation functions and was generalised to feed-forward multi-layer architectures in 1991 by Kurt Hornik. Recent work also showed that universal approximation also holds for non-bounded activation functions such as the rectified linear unit. The universal approximation theorem for deep neural networks concerns the capacity of networks with bounded width but the depth is allowed to grow. Lu et al. proved that if the width of a deep neural network with ReLU activation is strictly larger than the input dimension, then the network can approximate any Lebesgue integrable function; If the width is smaller or equal to the input dimension, then deep neural network is not a universal approximator.\nThe probabilistic interpretation derives from the field of machine learning. It features inference, as well as the optimization concepts of training and testing, related to fitting and generalization, respectively. More specifically, the probabilistic interpretation considers the activation nonlinearity as a cumulative distribution function. The probabilistic interpretation led to the introduction of dropout as regularizer in neural networks. The probabilistic interpretation was introduced by researchers including Hopfield, Widrow and Narendra and popularized in surveys such as the one by Bishop.\n\n\n== History ==\nThe term Deep Learning was introduced to the machine learning community by Rina Dechter in 1986, and to artificial neural networks by Igor Aizenberg and colleagues in 2000, in the context of Boolean threshold neurons.The first general, working learning algorithm for supervised, deep, feedforward, multilayer perceptrons was published by Alexey Ivakhnenko and Lapa in 1965. A 1971 paper described already a deep network with 8 layers trained by the group method of data handling algorithm.Other deep learning working architectures, specifically those built for computer vision, began with the Neocognitron introduced by Kunihiko Fukushima in 1980. In 1989, Yann LeCun et al. applied the standard backpropagation algorithm, which had been around as the reverse mode of automatic differentiation since 1970, to a deep neural network with the purpose of recognizing handwritten ZIP codes on mail. While the algorithm worked, training required 3 days.By 1991 such systems were used for recognizing isolated 2-D hand-written digits, while recognizing 3-D objects was done by matching 2-D images with a handcrafted 3-D object model. Weng et al. suggested that a human brain does not use a monolithic 3-D object model and in 1992 they published Cresceptron, a method for performing 3-D object recognition in cluttered scenes. Because it directly used natural images, Cresceptron started the beginning of general-purpose visual learning for natural 3D worlds. Cresceptron is a cascade of layers similar to Neocognitron. But while Neocognitron required a human programmer to hand-merge features, Cresceptron learned an open number of features in each layer without supervision, where each feature is represented by a convolution kernel. Cresceptron segmented each learned object from a cluttered scene through back-analysis through the network. Max pooling, now often adopted by deep neural networks (e.g. ImageNet tests), was first used in Cresceptron to reduce the position resolution by a factor of (2x2) to 1 through the cascade for better generalization.\nIn 1994, Andr\u00e9 de Carvalho, together with Mike Fairhurst and David Bisset, published experimental results of a multi-layer boolean neural network, also known as a weightless neural network, composed of a 3-layers self-organising feature extraction neural network module (SOFT) followed by a multi-layer classification neural network module (GSN), which were independently trained. Each layer in the feature extraction module extracted features with growing complexity regarding the previous layer.In 1995, Brendan Frey demonstrated that it was possible to train (over two days) a network containing six fully connected layers and several hundred hidden units using the wake-sleep algorithm, co-developed with Peter Dayan and Hinton. Many factors contribute to the slow speed, including the vanishing gradient problem analyzed in 1991 by Sepp Hochreiter.Simpler models that use task-specific handcrafted features such as Gabor filters and support vector machines (SVMs) were a popular choice in the 1990s and 2000s, because of artificial neural network's (ANN) computational cost and a lack of understanding of how the brain wires its biological networks.\nBoth shallow and deep learning (e.g., recurrent nets) of ANNs have been explored for many years. These methods never outperformed non-uniform internal-handcrafting Gaussian mixture model/Hidden Markov model (GMM-HMM) technology based on generative models of speech trained discriminatively. Key difficulties have been analyzed, including gradient diminishing and weak temporal correlation structure in neural predictive models. Additional difficulties were the lack of training data and limited computing power.\nMost speech recognition researchers moved away from neural nets to pursue generative modeling. An exception was at SRI International in the late 1990s. Funded by the US government's NSA and DARPA, SRI studied deep neural networks in speech and speaker recognition. The speaker recognition team led by Larry Heck achieved the first significant success with deep neural networks in speech processing in the 1998 National Institute of Standards and Technology Speaker Recognition evaluation. While SRI experienced success with deep neural networks in speaker recognition, they were unsuccessful in demonstrating similar success in speech recognition.\nThe principle of elevating \"raw\" features over hand-crafted optimization was first explored successfully in the architecture of deep autoencoder on the \"raw\" spectrogram or linear filter-bank features in the late 1990s, showing its superiority over the Mel-Cepstral features that contain stages of fixed transformation from spectrograms. The raw features of speech, waveforms, later produced excellent larger-scale results.Many aspects of speech recognition were taken over by a deep learning method called long short-term memory (LSTM), a recurrent neural network published by Hochreiter and Schmidhuber in 1997. LSTM RNNs avoid the vanishing gradient problem and can learn \"Very Deep Learning\" tasks that require memories of events that happened thousands of discrete time steps before, which is important for speech. In 2003, LSTM started to become competitive with traditional speech recognizers on certain tasks. Later it was combined with connectionist temporal classification (CTC) in stacks of LSTM RNNs. In 2015, Google's speech recognition reportedly experienced a dramatic performance jump of 49% through CTC-trained LSTM, which they made available through Google Voice Search.In 2006, publications by Geoff Hinton, Ruslan Salakhutdinov, Osindero and Teh showed how a many-layered feedforward neural network could be effectively pre-trained one layer at a time, treating each layer in turn as an unsupervised restricted Boltzmann machine, then fine-tuning it using supervised backpropagation. The papers referred to learning for deep belief nets.\nDeep learning is part of state-of-the-art systems in various disciplines, particularly computer vision and automatic speech recognition (ASR). Results on commonly used evaluation sets such as TIMIT (ASR) and MNIST (image classification), as well as a range of large-vocabulary speech recognition tasks have steadily improved. Convolutional neural networks (CNNs) were superseded for ASR by CTC for LSTM. but are more successful in computer vision.\nThe impact of deep learning in industry began in the early 2000s, when CNNs already processed an estimated 10% to 20% of all the checks written in the US, according to Yann LeCun. Industrial applications of deep learning to large-scale speech recognition started around 2010.\nThe 2009 NIPS Workshop on Deep Learning for Speech Recognition was motivated by the limitations of deep generative models of speech, and the possibility that given more capable hardware and large-scale data sets that deep neural nets (DNN) might become practical. It was believed that pre-training DNNs using generative models of deep belief nets (DBN) would overcome the main difficulties of neural nets. However, it was discovered that replacing pre-training with large amounts of training data for straightforward backpropagation when using DNNs with large, context-dependent output layers produced error rates dramatically lower than then-state-of-the-art Gaussian mixture model (GMM)/Hidden Markov Model (HMM) and also than more-advanced generative model-based systems. The nature of the recognition errors produced by the two types of systems was characteristically different, offering technical insights into how to integrate deep learning into the existing highly efficient, run-time speech decoding system deployed by all major speech recognition systems. Analysis around 2009-2010, contrasted the GMM (and other generative speech models) vs. DNN models, stimulated early industrial investment in deep learning for speech recognition, eventually leading to pervasive and dominant use in that industry. That analysis was done with comparable performance (less than 1.5% in error rate) between discriminative DNNs and generative models.In 2010, researchers extended deep learning from TIMIT to large vocabulary speech recognition, by adopting large output layers of the DNN based on context-dependent HMM states constructed by decision trees.Advances in hardware have enabled renewed interest in deep learning. In 2009, Nvidia was involved in what was called the \u201cbig bang\u201d of deep learning, \u201cas deep-learning neural networks were trained with Nvidia graphics processing units (GPUs).\u201d That year, Google Brain used Nvidia GPUs to create capable DNNs. While there, Andrew Ng determined that GPUs could increase the speed of deep-learning systems by about 100 times. In particular, GPUs are well-suited for the matrix/vector computations involved in machine learning. GPUs speed up training algorithms by orders of magnitude, reducing running times from weeks to days. Further, specialized hardware and algorithm optimizations can be used for efficient processing of deep learning models.\n\n\n=== Deep learning revolution ===\n\nIn 2012, a team led by George E. Dahl won the \"Merck Molecular Activity Challenge\" using multi-task deep neural networks to predict the biomolecular target of one drug. In 2014, Hochreiter's group used deep learning to detect off-target and toxic effects of environmental chemicals in nutrients, household products and drugs and won the \"Tox21 Data Challenge\" of NIH, FDA and NCATS.Significant additional impacts in image or object recognition were felt from 2011 to 2012. Although CNNs trained by backpropagation had been around for decades, and GPU implementations of NNs for years, including CNNs, fast implementations of CNNs with max-pooling on GPUs in the style of Ciresan and colleagues were needed to progress on computer vision. In 2011, this approach achieved for the first time superhuman performance in a visual pattern recognition contest. Also in 2011, it won the ICDAR Chinese handwriting contest, and in May 2012, it won the ISBI image segmentation contest. Until 2011, CNNs did not play a major role at computer vision conferences, but in June 2012, a paper by Ciresan et al. at the leading conference CVPR showed how max-pooling CNNs on GPU can dramatically improve many vision benchmark records. In October 2012, a similar system by Krizhevsky et al. won the large-scale ImageNet competition by a significant margin over shallow machine learning methods. In November 2012, Ciresan et al.'s system also won the ICPR contest on analysis of large medical images for cancer detection, and in the following year also the MICCAI Grand Challenge on the same topic. In 2013 and 2014, the error rate on the ImageNet task using deep learning was further reduced, following a similar trend in large-scale speech recognition. The Wolfram Image Identification project publicized these improvements.Image classification was then extended to the more challenging task of generating descriptions (captions) for images, often as a combination of CNNs and LSTMs.Some researchers assess that the October 2012 ImageNet victory anchored the start of a \"deep learning revolution\" that has transformed the AI industry.In March 2019, Yoshua Bengio, Geoffrey Hinton and Yann LeCun were awarded the Turing Award for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing.\n\n\n== Neural networks ==\n\n\n=== Artificial neural networks ===\n\nArtificial neural networks (ANNs) or connectionist systems are computing systems inspired by the biological neural networks that constitute animal brains. Such systems learn (progressively improve their ability) to do tasks by considering examples, generally without task-specific programming. For example, in image recognition, they might learn to identify images that contain cats by analyzing example images that have been manually labeled as \"cat\" or \"no cat\" and using the analytic results to identify cats in other images. They have found most use in applications difficult to express with a traditional computer algorithm using rule-based programming.\nAn ANN is based on a collection of connected units called artificial neurons, (analogous to biological neurons in a biological brain). Each connection (synapse) between neurons can transmit a signal to another neuron. The receiving (postsynaptic) neuron can process the signal(s) and then signal downstream neurons connected to it. Neurons may have state, generally represented by real numbers, typically between 0 and 1. Neurons and synapses may also have a weight that varies as learning proceeds, which can increase or decrease the strength of the signal that it sends downstream.\nTypically, neurons are organized in layers. Different layers may perform different kinds of transformations on their inputs. Signals travel from the first (input), to the last (output) layer, possibly after traversing the layers multiple times.\nThe original goal of the neural network approach was to solve problems in the same way that a human brain would. Over time, attention focused on matching specific mental abilities, leading to deviations from biology such as backpropagation, or passing information in the reverse direction and adjusting the network to reflect that information.\nNeural networks have been used on a variety of tasks, including computer vision, speech recognition, machine translation, social network filtering, playing board and video games and medical diagnosis.\nAs of 2017, neural networks typically have a few thousand to a few million units and millions of connections. Despite this number being several order of magnitude less than the number of neurons on a human brain, these networks can perform many tasks at a level beyond that of humans (e.g., recognizing faces, playing \"Go\" ).\n\n\n=== Deep neural networks ===\nA deep neural network (DNN) is an artificial neural network (ANN) with multiple layers between the input and output layers. The DNN finds the correct mathematical manipulation to turn the input into the output, whether it be a linear relationship or a non-linear relationship. The network moves through the layers calculating the probability of each output. For example, a DNN that is trained to recognize dog breeds will go over the given image and calculate the probability that the dog in the image is a certain breed. The user can review the results and select which probabilities the network should display (above a certain threshold, etc.) and return the proposed label. Each mathematical manipulation as such is considered a layer, and complex DNN have many layers, hence the name \"deep\" networks.\nDNNs can model complex non-linear relationships. DNN architectures generate compositional models where the object is expressed as a layered composition of primitives. The extra layers enable composition of features from lower layers, potentially modeling complex data with fewer units than a similarly performing shallow network.Deep architectures include many variants of a few basic approaches. Each architecture has found success in specific domains. It is not always possible to compare the performance of multiple architectures, unless they have been evaluated on the same data sets.\nDNNs are typically feedforward networks in which data flows from the input layer to the output layer without looping back. At first, the DNN creates a map of virtual neurons and assigns random numerical values, or \"weights\", to connections between them. The weights and inputs are multiplied and return an output between 0 and 1. If the network did not accurately recognize a particular pattern, an algorithm would adjust the weights. That way the algorithm can make certain parameters more influential, until it determines the correct mathematical manipulation to fully process the data.\nRecurrent neural networks (RNNs), in which data can flow in any direction, are used for applications such as language modeling. Long short-term memory is particularly effective for this use.Convolutional deep neural networks (CNNs) are used in computer vision. CNNs also have been applied to acoustic modeling for automatic speech recognition (ASR).\n\n\n==== Challenges ====\nAs with ANNs, many issues can arise with naively trained DNNs. Two common issues are overfitting and computation time.\nDNNs are prone to overfitting because of the added layers of abstraction, which allow them to model rare dependencies in the training data. Regularization methods such as Ivakhnenko's unit pruning or weight decay (\n  \n    \n      \n        \n          \u2113\n          \n            2\n          \n        \n      \n    \n    {\\displaystyle \\ell _{2}}\n  -regularization) or sparsity (\n  \n    \n      \n        \n          \u2113\n          \n            1\n          \n        \n      \n    \n    {\\displaystyle \\ell _{1}}\n  -regularization) can be applied during training to combat overfitting. Alternatively dropout regularization randomly omits units from the hidden layers during training. This helps to exclude rare dependencies. Finally, data can be augmented via methods such as cropping and rotating such that smaller training sets can be increased in size to reduce the chances of overfitting.DNNs must consider many training parameters, such as the size (number of layers and number of units per layer), the learning rate, and initial weights. Sweeping through the parameter space for optimal parameters may not be feasible due to the cost in time and computational resources. Various tricks, such as batching (computing the gradient on several training examples at once rather than individual examples) speed up computation. Large processing capabilities of many-core architectures (such as GPUs or the Intel Xeon Phi) have produced significant speedups in training, because of the suitability of such processing architectures for the matrix and vector computations.Alternatively, engineers may look for other types of neural networks with more straightforward and convergent training algorithms. CMAC (cerebellar model articulation controller) is one such kind of neural network. It doesn't require learning rates or randomized initial weights for CMAC. The training process can be guaranteed to converge in one step with a new batch of data, and the computational complexity of the training algorithm is linear with respect to the number of neurons involved.\n\n\n== Applications ==\n\n\n=== Automatic speech recognition ===\n\nLarge-scale automatic speech recognition is the first and most convincing successful case of deep learning. LSTM RNNs can learn \"Very Deep Learning\" tasks that involve multi-second intervals containing speech events separated by thousands of discrete time steps, where one time step corresponds to about 10 ms. LSTM with forget gates is competitive with traditional speech recognizers on certain tasks.The initial success in speech recognition was based on small-scale recognition tasks based on TIMIT. The data set contains 630 speakers from eight major dialects of American English, where each speaker reads 10 sentences. Its small size lets many configurations be tried. More importantly, the TIMIT task concerns phone-sequence recognition, which, unlike word-sequence recognition, allows weak phone bigram language models. This lets the strength of the acoustic modeling aspects of speech recognition be more easily analyzed. The error rates listed below, including these early results and measured as percent phone error rates (PER), have been summarized since 1991.\n\nThe debut of DNNs for speaker recognition in the late 1990s and speech recognition around 2009-2011 and of LSTM around 2003-2007, accelerated progress in eight major areas:\nScale-up/out and accelerated DNN training and decoding\nSequence discriminative training\nFeature processing by deep models with solid understanding of the underlying mechanisms\nAdaptation of DNNs and related deep models\nMulti-task and transfer learning by DNNs and related deep models\nCNNs and how to design them to best exploit domain knowledge of speech\nRNN and its rich LSTM variants\nOther types of deep models including tensor-based models and integrated deep generative/discriminative models.All major commercial speech recognition systems (e.g., Microsoft Cortana, Xbox, Skype Translator, Amazon Alexa, Google Now, Apple Siri, Baidu and iFlyTek voice search, and a range of Nuance speech products, etc.) are based on deep learning.\n\n\n=== Image recognition ===\n\nA common evaluation set for image classification is the MNIST database data set. MNIST is composed of handwritten digits and includes 60,000 training examples and 10,000 test examples. As with TIMIT, its small size lets users test multiple configurations. A comprehensive list of results on this set is available.Deep learning-based image recognition has become \"superhuman\", producing more accurate results than human contestants. This first occurred in 2011.Deep learning-trained vehicles now interpret 360\u00b0 camera views. Another example is Facial Dysmorphology Novel Analysis (FDNA) used to analyze cases of human malformation connected to a large database of genetic syndromes.\n\n\n=== Visual art processing ===\nClosely related to the progress that has been made in image recognition is the increasing application of deep learning techniques to various visual art tasks. DNNs have proven themselves capable, for example, of a) identifying the style period of a given painting, b) Neural Style Transfer - capturing the style of a given artwork and applying it in a visually pleasing manner to an arbitrary photograph or video, and c) generating striking imagery based on random visual input fields.\n\n\n=== Natural language processing ===\n\nNeural networks have been used for implementing language models since the early 2000s. LSTM helped to improve machine translation and language modeling.Other key techniques in this field are negative sampling and word embedding. Word embedding, such as word2vec, can be thought of as a representational layer in a deep learning architecture that transforms an atomic word into a positional representation of the word relative to other words in the dataset; the position is represented as a point in a vector space. Using word embedding as an RNN input layer allows the network to parse sentences and phrases using an effective compositional vector grammar. A compositional vector grammar can be thought of as probabilistic context free grammar (PCFG) implemented by an RNN. Recursive auto-encoders built atop word embeddings can assess sentence similarity and detect paraphrasing. Deep neural architectures provide the best results for constituency parsing, sentiment analysis, information retrieval, spoken language understanding, machine translation, contextual entity linking, writing style recognition, Text classification and others.Recent developments generalize word embedding to sentence embedding.\nGoogle Translate (GT) uses a large end-to-end long short-term memory network. Google Neural Machine Translation (GNMT) uses an example-based machine translation method in which the system \"learns from millions of examples.\" It translates \"whole sentences at a time, rather than pieces. Google Translate supports over one hundred languages. The network encodes the \"semantics of the sentence rather than simply memorizing phrase-to-phrase translations\". GT uses English as an intermediate between most language pairs.\n\n\n=== Drug discovery and toxicology ===\n\nA large percentage of candidate drugs fail to win regulatory approval. These failures are caused by insufficient efficacy (on-target effect), undesired interactions (off-target effects), or unanticipated toxic effects. Research has explored use of deep learning to predict the biomolecular targets, off-targets, and toxic effects of environmental chemicals in nutrients, household products and drugs.AtomNet is a deep learning system for structure-based rational drug design. AtomNet was used to predict novel candidate biomolecules for disease targets such as the Ebola virus and multiple sclerosis.In 2019 generative neural networks were used to produce molecules that were validated experimentally all the way into mice , .\n\n\n=== Customer relationship management ===\n\nDeep reinforcement learning has been used to approximate the value of possible direct marketing actions, defined in terms of RFM variables. The estimated value function was shown to have a natural interpretation as customer lifetime value.\n\n\n=== Recommendation systems ===\n\nRecommendation systems have used deep learning to extract meaningful features for a latent factor model for content-based music recommendations.  Multiview deep learning has been applied for learning user preferences from multiple domains. The model uses a hybrid collaborative and content-based approach and enhances recommendations in multiple tasks.\n\n\n=== Bioinformatics ===\n\nAn autoencoder ANN was used in bioinformatics, to predict gene ontology annotations and gene-function relationships.In medical informatics, deep learning was used to predict sleep quality based on data from wearables and predictions of health complications from electronic health record data. Deep learning has also showed efficacy in healthcare.\n\n\n=== Medical Image Analysis ===\nDeep learning has been shown to produce competitive results in medical application such as cancer cell classification, lesion detection, organ segmentation and image enhancement\n\n\n=== Mobile advertising ===\nFinding the appropriate mobile audience for mobile advertising is always challenging, since many data points must be considered and assimilated before a target segment can be created and used in ad serving by any ad server. Deep learning has been used to interpret large, many-dimensioned advertising datasets. Many data points are collected during the request/serve/click internet advertising cycle. This information can form the basis of machine learning to improve ad selection.\n\n\n=== Image restoration ===\nDeep learning has been successfully applied to inverse problems such as denoising, super-resolution, inpainting, and film colorization. These applications include learning methods such as \"Shrinkage Fields for Effective Image Restoration\" which trains on an image dataset, and Deep Image Prior, which trains on the image that needs restoration.\n\n\n=== Financial fraud detection ===\nDeep learning is being successfully applied to financial fraud detection and anti-money laundering. \"Deep anti-money laundering detection system can spot and recognize relationships and similarities between data and, further down the road, learn to detect anomalies or classify and predict specific events\". The solution leverages both supervised learning techniques, such as the classification of suspicious transactions, and unsupervised learning, e.g. anomaly detection.\n\n\n=== Military ===\nThe United States Department of Defense applied deep learning to train robots in new tasks through observation.\n\n\n== Relation to human cognitive and brain development ==\nDeep learning is closely related to a class of theories of brain development (specifically, neocortical development) proposed by cognitive neuroscientists in the early 1990s. These developmental theories were instantiated in computational models, making them predecessors of deep learning systems. These developmental models share the property that various proposed learning dynamics in the brain (e.g., a wave of nerve growth factor) support the self-organization somewhat analogous to the neural networks utilized in deep learning models. Like the neocortex, neural networks employ a hierarchy of layered filters in which each layer considers information from a prior layer (or the operating environment), and then passes its output (and possibly the original input), to other layers. This process yields a self-organizing stack of transducers, well-tuned to their operating environment. A 1995 description stated, \"...the infant's brain seems to organize itself under the influence of waves of so-called trophic-factors ... different regions of the brain become connected sequentially, with one layer of tissue maturing before another and so on until the whole brain is mature.\"A variety of approaches have been used to investigate the plausibility of deep learning models from a neurobiological perspective. On the one hand, several variants of the backpropagation algorithm have been proposed in order to increase its processing realism. Other researchers have argued that unsupervised forms of deep learning, such as those based on hierarchical generative models and deep belief networks, may be closer to biological reality. In this respect, generative neural network models have been related to neurobiological evidence about sampling-based processing in the cerebral cortex.Although a systematic comparison between the human brain organization and the neuronal encoding in deep networks has not yet been established, several analogies have been reported. For example, the computations performed by deep learning units could be similar to those of actual neurons and neural populations. Similarly, the representations developed by deep learning models are similar to those measured in the primate visual system both at the single-unit and at the population levels.\n\n\n== Commercial activity ==\nFacebook's AI lab performs tasks such as automatically tagging uploaded pictures with the names of the people in them.Google's DeepMind Technologies developed a system capable of learning how to play Atari video games using only pixels as data input. In 2015 they demonstrated their AlphaGo system, which learned the game of Go well enough to beat a professional Go player. Google Translate uses an LSTM to translate between more than 100 languages.\nIn 2015, Blippar demonstrated a mobile augmented reality application that uses deep learning to recognize objects in real time.In 2017, Covariant.ai was launched, which focuses on integrating deep learning into factories.As of 2008, researchers at The University of Texas at Austin (UT) developed a machine learning framework called Training an Agent Manually via Evaluative Reinforcement, or TAMER, which proposed new methods for robots or computer programs to learn how to perform tasks by interacting with a human instructor. First developed as TAMER, a new algorithm called Deep TAMER was later introduced in 2018 during a collaboration between U.S. Army Research Laboratory (ARL) and UT researchers. Deep TAMER used deep learning to provide a robot the ability to learn new tasks through observation. Using Deep TAMER, a robot learned a task with a human trainer, watching video streams or observing a human perform a task in-person. The robot later practiced the task with the help of some coaching from the trainer, who provided feedback such as \u201cgood job\u201d and \u201cbad job.\u201d\n\n\n== Criticism and comment ==\nDeep learning has attracted both criticism and comment, in some cases from outside the field of computer science.\n\n\n=== Theory ===\n\nA main criticism concerns the lack of theory surrounding some methods. Learning in the most common deep architectures is implemented using well-understood gradient descent. However, the theory surrounding other algorithms, such as contrastive divergence is less clear. (e.g., Does it converge? If so, how fast? What is it approximating?) Deep learning methods are often looked at as a black box, with most confirmations done empirically, rather than theoretically.\nOthers point out that deep learning should be looked at as a step towards realizing strong AI, not as an all-encompassing solution. Despite the power of deep learning methods, they still lack much of the functionality needed for realizing this goal entirely. Research psychologist Gary Marcus noted:\"Realistically, deep learning is only part of the larger challenge of building intelligent machines. Such techniques lack ways of representing causal relationships (...) have no obvious ways of performing logical inferences, and they are also still a long way from integrating abstract knowledge, such as information about what objects are, what they are for, and how they are typically used. The most powerful A.I. systems, like Watson (...) use techniques like deep learning as just one element in a very complicated ensemble of techniques, ranging from the statistical technique of Bayesian inference to deductive reasoning.\"As an alternative to this emphasis on the limits of deep learning, one author speculated that it might be possible to train a machine vision stack to perform the sophisticated task of discriminating between \"old master\" and amateur figure drawings, and hypothesized that such a sensitivity might represent the rudiments of a non-trivial machine empathy. This same author proposed that this would be in line with anthropology, which identifies a concern with aesthetics as a key element of behavioral modernity.In further reference to the idea that artistic sensitivity might inhere within relatively low levels of the cognitive hierarchy, a published series of graphic representations of the internal states of deep (20-30 layers) neural networks attempting to discern within essentially random data the images on which they were trained demonstrate a visual appeal: the original research notice received well over 1,000 comments, and was the subject of what was for a time the most frequently accessed article on The Guardian's web site.\n\n\n=== Errors ===\nSome deep learning architectures display problematic behaviors, such as confidently classifying unrecognizable images as belonging to a familiar category of ordinary images and misclassifying minuscule perturbations of correctly classified images. Goertzel hypothesized that these behaviors are due to limitations in their internal representations and that these limitations would inhibit integration into heterogeneous multi-component artificial general intelligence (AGI) architectures. These issues may possibly be addressed by deep learning architectures that internally form states homologous to image-grammar decompositions of observed entities and events. Learning a grammar (visual or linguistic) from training data would be equivalent to restricting the system to commonsense reasoning that operates on concepts in terms of grammatical production rules and is a basic goal of both human language acquisition and artificial intelligence (AI).\n\n\n=== Cyber threat ===\nAs deep learning moves from the lab into the world, research and experience shows that artificial neural networks are vulnerable to hacks and deception. By identifying patterns that these systems use to function, attackers can modify inputs to ANNs in such a way that the ANN finds a match that human observers would not recognize. For example, an attacker can make subtle changes to an image such that the ANN finds a match even though the image looks to a human nothing like the search target. Such a manipulation is termed an \u201cadversarial attack.\u201d In 2016 researchers used one ANN to doctor images in trial and error fashion, identify another's focal points and thereby generate images that deceived it. The modified images looked no different to human eyes. Another group showed that printouts of doctored images then photographed successfully tricked an image classification system. One defense is reverse image search, in which a possible fake image is submitted to a site such as TinEye that can then find other instances of it. A refinement is to search using only parts of the image, to identify images from which that piece may have been taken.Another group showed that certain psychedelic spectacles could fool a facial recognition system into thinking ordinary people were celebrities, potentially allowing one person to impersonate another. In 2017 researchers added stickers to stop signs and caused an ANN to misclassify them.ANNs can however be further trained to detect attempts at deception, potentially leading attackers and defenders into an arms race similar to the kind that already defines the malware defense industry. ANNs have been trained to defeat ANN-based anti-malware software by repeatedly attacking a defense with malware that was continually altered by a genetic algorithm until it tricked the anti-malware while retaining its ability to damage the target.Another group demonstrated that certain sounds could make the Google Now voice command system open a particular web address that would download malware.In \u201cdata poisoning,\u201d false data is continually smuggled into a machine learning system's training set to prevent it from achieving mastery.\n\n\n=== Reliance on human microwork ===\nMost Deep Learning systems rely on training and verification data that is generated and/or annotated by humans. It has been argued in media philosophy that not only low-payed clickwork (e.g. on Amazon Mechanical Turk) is regularly deployed for this purpose, but also implicit forms of human microwork that are often not recognized as such. The philosopher Rainer M\u00fchlhoff distinguishes five types of \"machinic capture\" of human microwork to generate training data: (1) gamification (the embedding of annotation or computation tasks in the flow of a game), (2) \"trapping and tracking\" (e.g. CAPTCHAs for image recognition or click-tracking on Google search results pages), (3) exploitation of social motivations (e.g. tagging faces on Facebook to obtain labeled facial images), (4) information mining (e.g. by leveraging quantified-self devices such as activity trackers) and (5) clickwork. M\u00fchlhoff argues that in most commercial end-user applications of Deep Learning such as Facebook's face recognition system, the need for training data does not stop once an ANN is trained. Rather, there is a continued demand for human-generated verification data to constantly calibrate and update the ANN. For this purpose Facebook introduced the feature that once a user is automatically recognized in an image, they receive a notification. They can choose whether of not they like to be publicly labeled on the image, or tell Facebook that it is not them in the picture. This user interface is a mechanism to generate \"a constant stream of  verification data\" to further train the network in real-time. As M\u00fchlhoff argues, involvement of human users to generate training and verification data is so typical for most commercial end-user applications of Deep Learning that such systems may be referred to as \"human-aided artificial intelligence\".  \n\n\n== See also ==\nApplications of artificial intelligence\nComparison of deep learning software\nCompressed sensing\nEcho state network\nList of artificial intelligence projects\nLiquid state machine\nList of datasets for machine learning research\nReservoir computing\nSparse coding\n\n\n== References ==\n\n\n== Further reading =="}, {"job": "data scientist", "skill": "text mining", "keywords": "text mining", "description": "According to Hotho et al. (2005) we can differ three different perspectives of text mining, namely text mining as information extraction, text mining as text data mining, and text mining as KDD (Knowledge Discovery in Databases) process. Text mining is \"the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources.\" Written resources can be websites, books, emails, reviews, articles.\nText mining, also referred to as text data mining, roughly equivalent to text analytics, is the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning. Text mining usually involves the process of structuring the input text (usually parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database), deriving patterns within the structured data, and finally evaluation and interpretation of the output. 'High quality' in text mining usually refers to some combination of relevance, novelty, and interest. Typical text mining tasks include text categorization, text clustering, concept/entity extraction, production of granular taxonomies, sentiment analysis, document summarization, and entity relation modeling (i.e., learning relations between named entities).\nText analysis involves information retrieval, lexical analysis to study word frequency distributions, pattern recognition, tagging/annotation, information extraction, data mining techniques including link and association analysis, visualization, and predictive analytics. The overarching goal is, essentially, to turn text into data for analysis, via application of natural language processing (NLP), different types of algorithms and analytical methods. An important phase of this process is the interpretation of the gathered information.\nA typical application is to scan a set of documents written in a natural language and either model the document set for predictive classification purposes or populate a database or search index with the information extracted.\nThe document is the basic element while starting with text mining. Here, we define a document as a unit of textual data, which normally exists in many types of collections.\n\n\n== Text analytics ==\nThe term text analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content of textual sources for business intelligence, exploratory data analysis, research, or investigation. The term is roughly synonymous with text mining; indeed, Ronen Feldman modified a 2000 description of \"text mining\" in 2004 to describe \"text analytics\". The latter term is now used more frequently in business settings while \"text mining\" is used in some of the earliest application areas, dating to the 1980s, notably life-sciences research and government intelligence.\nThe term text analytics also describes that application of text analytics to respond to business problems, whether independently or in conjunction with query and analysis of fielded, numerical data. It is a truism that 80 percent of business-relevant information originates in unstructured form, primarily text. These techniques and processes discover and present knowledge \u2013 facts, business rules, and relationships \u2013 that is otherwise locked in textual form, impenetrable to automated processing.\n\n\n== Text analysis processes ==\nSubtasks\u2014components of a larger text-analytics effort\u2014typically include:\n\nDimensionality reduction is important technique for pre-processing data. Technique is used to identify the root word for actual words and reduce the size of the text data.\nInformation retrieval or identification of a corpus is a preparatory step: collecting or identifying a set of textual materials, on the Web or held in a file system, database, or content corpus manager, for analysis.\nAlthough some text analytics systems apply exclusively advanced statistical methods, many others apply more extensive natural language processing, such as part of speech tagging, syntactic parsing, and other types of linguistic analysis.\nNamed entity recognition is the use of gazetteers or statistical techniques to identify named text features: people, organizations, place names, stock ticker symbols, certain abbreviations, and so on.\nDisambiguation\u2014the use of contextual clues\u2014may be required to decide where, for instance, \"Ford\" can refer to a former U.S. president, a vehicle manufacturer, a movie star, a river crossing, or some other entity.\nRecognition of Pattern Identified Entities: Features such as telephone numbers, e-mail addresses, quantities (with units) can be discerned via regular expression or other pattern matches.\nDocument clustering: identification of sets of similar text documents.\nCoreference: identification of noun phrases and other terms that refer to the same object.\nRelationship, fact, and event Extraction: identification of associations among entities and other information in text\nSentiment analysis involves discerning subjective (as opposed to factual) material and extracting various forms of attitudinal information: sentiment, opinion, mood, and emotion. Text analytics techniques are helpful in analyzing sentiment at the entity, concept, or topic level and in distinguishing opinion holder and opinion object.\nQuantitative text analysis is a set of techniques stemming from the social sciences where either a human judge or a computer extracts semantic or grammatical relationships between words in order to find out the meaning or stylistic patterns of, usually, a casual personal text for the purpose of psychological profiling etc.\n\n\n== Applications ==\nText mining technology is now broadly applied to a wide variety of government, research, and business needs. All these groups may use text mining for records management and searching documents relevant to their daily activities. Legal professionals may use text mining for e-discovery, for example. Governments and military groups use text mining for national security and intelligence purposes. Scientific researchers incorporate text mining approaches into efforts to organize large sets of text data (i.e., addressing the problem of unstructured data), to determine ideas communicated through text (e.g., sentiment analysis in social media) and to support scientific discovery in fields such as the life sciences and bioinformatics. In business, applications are used to support competitive intelligence and automated ad placement, among numerous other activities.\n\n\n=== Security applications ===\nMany text mining software packages are marketed for security applications, especially monitoring and analysis of online plain text sources such as Internet news, blogs, etc. for national security purposes. It is also involved in the study of text encryption/decryption.\n\n\n=== Biomedical applications ===\n\nA range of text mining applications in the biomedical literature has been described, including computational approaches to assist with studies in protein docking, protein interactions, and protein-disease associations. In addition, with large patient textual datasets in the clinical field, datasets of demographic information in population studies and adverse event reports, text mining can facilitate clinical studies and precision medicine. Text mining algorithms can facilitate the stratification and indexing of specific clinical events in large patient textual datasets of symptoms, side effects, and comorbidities from electronic health records, event reports, and reports from specific diagnostic tests. One online text mining application in the biomedical literature is PubGene, a publicly accessible search engine that combines biomedical text mining with network visualization. GoPubMed is a knowledge-based search engine for biomedical texts. Text mining techniques also enable us to extract unknown knowledge from unstructured documents in the clinical domain\n\n\n=== Software applications ===\nText mining methods and software is also being researched and developed by major firms, including IBM and Microsoft, to further automate the mining and analysis processes, and by different firms working in the area of search and indexing in general as a way to improve their results. Within public sector much effort has been concentrated on creating software for tracking and monitoring terrorist activities. For study purposes, Weka software is one of the most popular options in the scientific world, acting as an excellent entry point for beginners. For Python programmers, there is an excellent toolkit called NLTK for more general purposes. For more advanced programmers, there's also the Gensim library, which focuses on word embedding-based text representations.\n\n\n=== Online media applications ===\nText mining is being used by large media companies, such as the Tribune Company, to clarify information and to provide readers with greater search experiences, which in turn increases site \"stickiness\" and revenue. Additionally, on the back end, editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content.\n\n\n=== Business and marketing applications ===\nText mining is starting to be used in marketing as well, more specifically in analytical customer relationship management. Coussement and Van den Poel (2008) apply it to improve predictive analytics models for customer churn (customer attrition). Text mining is also being applied in stock returns prediction.\n\n\n=== Sentiment analysis ===\nSentiment analysis may involve analysis of movie reviews for estimating how favorable a review is for a movie.\nSuch an analysis may need a labeled data set or labeling of the affectivity of words.\nResources for affectivity of words and concepts have been made for WordNet and ConceptNet, respectively.\nText has been used to detect emotions in the related area of affective computing. Text based approaches to affective computing have been used on multiple corpora such as students evaluations, children stories and news stories.\n\n\n=== Scientific literature mining and academic applications ===\nThe issue of text mining is of importance to publishers who hold large databases of information needing indexing for retrieval. This is especially true in scientific disciplines, in which highly specific information is often contained within written text. Therefore, initiatives have been taken such as Nature's proposal for an Open Text Mining Interface (OTMI) and the National Institutes of Health's common Journal Publishing Document Type Definition (DTD) that would provide semantic cues to machines to answer specific queries contained within text without removing publisher barriers to public access.\nAcademic institutions have also become involved in the text mining initiative:\n\nThe National Centre for Text Mining (NaCTeM), is the first publicly funded text mining centre in the world. NaCTeM is operated by the University of Manchester in close collaboration with the Tsujii Lab, University of Tokyo. NaCTeM provides customised tools, research facilities and offers advice to the academic community. They are funded by the Joint Information Systems Committee (JISC) and two of the UK Research Councils (EPSRC & BBSRC). With an initial focus on text mining in the biological and biomedical sciences, research has since expanded into the areas of social sciences.\nIn the United States, the School of Information at University of California, Berkeley is developing a program called BioText to assist biology researchers in text mining and analysis.\nThe Text Analysis Portal for Research (TAPoR), currently housed at the University of Alberta, is a scholarly project to catalogue text analysis applications and create a gateway for researchers new to the practice.\n\n\n==== Methods for scientific literature mining ====\nComputational methods have been developed to assist with information retrieval from scientific literature. Published approaches include methods for searching, determining novelty, and clarifying homonyms among technical reports.\n\n\n=== Digital humanities and computational sociology ===\nThe automatic analysis of vast textual corpora has created the possibility for scholars to analyze\nmillions of documents in multiple languages with very limited manual intervention. Key enabling technologies have been parsing, machine translation, topic categorization, and machine learning.\n\nThe automatic parsing of textual corpora has enabled the extraction of actors and their relational networks on a vast scale, turning textual data into network data.  The resulting networks, which can contain thousands of nodes, are then analyzed by using tools from network theory to identify the key actors, the key communities or parties, and general properties such as robustness or structural stability of the overall network, or centrality of certain nodes. This automates the approach introduced by quantitative narrative analysis, whereby subject-verb-object triplets are identified with pairs of actors linked by an action, or pairs formed by actor-object.Content analysis has been a traditional part of social sciences and media studies for a long time. The automation of content analysis has allowed a \"big data\" revolution to take place in that field, with studies in social media and newspaper content that include millions of news items. Gender bias, readability, content similarity, reader preferences, and even mood have been analyzed based on text mining methods over millions of documents. The analysis of readability, gender bias and topic bias was demonstrated in Flaounas et al. showing how different topics have different gender biases and levels of readability; the possibility to detect mood patterns in a vast population by analyzing Twitter content was demonstrated as well.\n\n\n== Software ==\nText mining computer programs are available from many commercial and open source companies and sources. See List of text mining software.\n\n\n== Intellectual property law ==\n\n\n=== Situation in Europe ===\n\nUnder European copyright and database laws, the mining of in-copyright works (such as by web mining) without the permission of the copyright owner is illegal. In the UK in 2014, on the recommendation of the Hargreaves review, the government amended copyright law to allow text mining as a limitation and exception. It was  the second country in the world to do so, following Japan, which introduced a mining-specific exception in 2009. However, owing to the restriction of the Information Society Directive (2001), the UK exception only allows content mining for non-commercial purposes. UK copyright law does not allow this provision to be overridden by contractual terms and conditions.\nThe European Commission facilitated stakeholder discussion on text and data mining in 2013, under the title of Licenses for Europe. The fact that the focus on the solution to this legal issue was licenses, and not limitations and exceptions to copyright law, led representatives of universities, researchers, libraries, civil society groups and open access publishers to leave the stakeholder dialogue in May 2013.\n\n\n=== Situation in the United States ===\nUS copyright law, and in particular its fair use provisions, means that text mining in America, as well as other fair use countries such as Israel, Taiwan and South Korea, is viewed as being legal. As text mining is transformative, meaning that it does not supplant the original work, it is viewed as being lawful under fair use. For example, as part of the Google Book settlement the presiding judge on the case ruled that Google's digitization project of in-copyright books was lawful, in part because of the transformative uses that the digitization project displayed\u2014one such use being text and data mining.\n\n\n== Implications ==\nUntil recently, websites most often used text-based searches, which only found documents containing specific user-defined words or phrases. Now, through use of a semantic web, text mining can find content based on meaning and context (rather than just by a specific word). Additionally, text mining software can be used to build large dossiers of information about specific people and events.  For example, large datasets based on data extracted from news reports can be built to facilitate social networks analysis or counter-intelligence.  In effect, the text mining software may act in a capacity similar to an intelligence analyst or research librarian, albeit with a more limited scope of analysis. Text mining is also used in some email spam filters as a way of determining the characteristics of messages that are likely to be advertisements or other unwanted material. Text mining plays an important role in determining financial market sentiment.\n\n\n== Future ==\nIncreasing interest is being paid to multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning.\nThe challenge of exploiting the large proportion of enterprise information that originates in \"unstructured\" form has been recognized for decades. It is recognized in the earliest definition of business intelligence (BI), in an October 1958 IBM Journal article by H.P. Luhn, A Business Intelligence System, which describes a system that will:\n\n\"...utilize data-processing machines for auto-abstracting and auto-encoding of documents and for creating interest profiles for each of the 'action points' in an organization. Both incoming and internally generated documents are automatically abstracted, characterized by a word pattern, and sent automatically to appropriate action points.\"\n\nYet as management information systems developed starting in the 1960s, and as BI emerged in the '80s and '90s as a software category and field of practice, the emphasis was on numerical data stored in relational databases. This is not surprising: text in \"unstructured\" documents is hard to process. The emergence of text analytics in its current form stems from a refocusing of research in the late 1990s from algorithm development to application, as described by Prof. Marti A. Hearst in the paper Untangling Text Data Mining:\nFor almost a decade the computational linguistics community has viewed large text collections as a resource to be tapped in order to produce better text analysis algorithms. In this paper, I have attempted to suggest a new emphasis: the use of large online text collections to discover new facts and trends about the world itself. I suggest that to make progress we do not need fully artificial intelligent text analysis; rather, a mixture of computationally-driven and user-guided analysis may open the door to exciting new results.\n\nHearst's 1999 statement of need fairly well describes the state of text analytics technology and practice a decade later.\n\n\n== See also ==\n\n\n== References ==\n\n\n=== Citations ===\n\n\n=== Sources ===\nAnaniadou, S. and McNaught, J. (Editors) (2006). Text Mining for Biology and Biomedicine. Artech House Books. ISBN 978-1-58053-984-5\nBilisoly, R. (2008). Practical Text Mining with Perl. New York: John Wiley & Sons. ISBN 978-0-470-17643-6\nFeldman, R., and Sanger, J. (2006). The Text Mining Handbook. New York: Cambridge University Press. ISBN 978-0-521-83657-9\nHotho, A., N\u00fcrnberger, A. and Paa\u00df, G. (2005). \"A brief survey of text mining\". In Ldv Forum, Vol. 20(1), p. 19-62\nIndurkhya, N., and Damerau, F. (2010). Handbook Of Natural Language Processing, 2nd Edition. Boca Raton, FL: CRC Press. ISBN 978-1-4200-8592-1\nKao, A., and Poteet, S. (Editors). Natural Language Processing and Text Mining. Springer. ISBN 1-84628-175-X\nKonchady, M. Text Mining Application Programming (Programming Series). Charles River Media. ISBN 1-58450-460-9\nManning, C., and Schutze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press. ISBN 978-0-262-13360-9\nMiner, G., Elder, J., Hill. T, Nisbet, R., Delen, D. and Fast, A. (2012). Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications. Elsevier Academic Press. ISBN 978-0-12-386979-1\nMcKnight, W. (2005). \"Building business intelligence: Text data mining in business intelligence\". DM Review, 21-22.\nSrivastava, A., and Sahami. M. (2009). Text Mining: Classification, Clustering, and Applications. Boca Raton, FL: CRC Press. ISBN 978-1-4200-5940-3\nZanasi, A. (Editor) (2007). Text Mining and its Applications to Intelligence, CRM and Knowledge Management. WIT Press. ISBN 978-1-84564-131-3\n\n\n== External links ==\nMarti Hearst: What Is Text Mining? (October, 2003)\nAutomatic Content Extraction, Linguistic Data Consortium\nAutomatic Content Extraction, NIST"}, {"job": "data scientist", "skill": "apache hadoop", "keywords": "apache hadoop", "description": "Apache Hadoop ( ) is a collection of open-source software utilities that facilitate using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Originally designed for computer clusters built from commodity hardware\u2014still the common use\u2014it has also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a MapReduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the dataset to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.The base Apache Hadoop framework is composed of the following modules:\n\nHadoop Common \u2013 contains libraries and utilities needed by other Hadoop modules;\nHadoop Distributed File System (HDFS) \u2013 a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster;\nHadoop YARN \u2013 (introduced in 2012) a platform responsible for managing computing resources in clusters and using them for scheduling users' applications;\nHadoop MapReduce \u2013 an implementation of the MapReduce programming model for large-scale data processing.The term Hadoop is often used for both base modules and sub-modules and also the ecosystem, or collection of additional software packages that can be installed on top of or alongside Hadoop, such as Apache Pig, Apache Hive, Apache HBase, Apache Phoenix, Apache Spark, Apache ZooKeeper, Cloudera Impala, Apache Flume, Apache Sqoop, Apache Oozie, and Apache Storm.Apache Hadoop's MapReduce and HDFS components were inspired by Google papers on MapReduce and Google File System.The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command line utilities written as shell scripts. Though MapReduce Java code is common, any programming language can be used with Hadoop Streaming to implement the map and reduce parts of the user's program. Other projects in the Hadoop ecosystem expose richer user interfaces.\n\n\n== History ==\nAccording to its co-founders, Doug Cutting and Mike Cafarella, the genesis of Hadoop was the Google File System paper that was published in October 2003. This paper spawned another one from Google \u2013  \"MapReduce: Simplified Data Processing on Large Clusters\". Development started on the Apache Nutch project, but was moved to the new Hadoop subproject in January 2006. Doug Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. The initial code that was factored out of Nutch consisted of about 5,000 lines of code for HDFS and about 6,000 lines of code for MapReduce.\nIn March 2006, Owen O\u2019Malley was the first committer to add to the Hadoop project; Hadoop 0.1.0 was released in April 2006. It continues to evolve through contributions that are being made to the project.\n\n\n== Architecture ==\n\nHadoop consists of the Hadoop Common package, which provides file system and operating system level abstractions, a MapReduce engine (either MapReduce/MR1 or YARN/MR2) and the Hadoop Distributed File System (HDFS). The Hadoop Common package contains the Java ARchive (JAR) files and scripts needed to start Hadoop.\nFor effective scheduling of work, every Hadoop-compatible file system should provide location awareness, which is the name of the rack, specifically the network switch where a worker node is. Hadoop applications can use this information to execute code on the node where the data is, and, failing that, on the same rack/switch to reduce backbone traffic. HDFS uses this method when replicating data for data redundancy across multiple racks. This approach reduces the impact of a rack power outage or switch failure; if any of these hardware failures occurs, the data will remain available.\n\nA small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a Job Tracker, Task Tracker, NameNode, and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only and compute-only worker nodes. These are normally used only in nonstandard applications.Hadoop requires Java Runtime Environment (JRE) 1.6 or higher. The standard startup and shutdown scripts require that Secure Shell (SSH) be set up between nodes in the cluster.In a larger cluster, HDFS nodes are managed through a dedicated NameNode server to host the file system index, and a secondary NameNode that can generate snapshots of the namenode's memory structures, thereby preventing file-system corruption and loss of data. Similarly, a standalone JobTracker server can manage job scheduling across nodes. When Hadoop MapReduce is used with an alternate file system, the NameNode, secondary NameNode, and DataNode architecture of HDFS are replaced by the file-system-specific equivalents.\n\n\n=== File systems ===\n\n\n==== Hadoop distributed file system ====\nThe Hadoop distributed file system (HDFS) is a distributed, scalable, and portable file system written in Java for the Hadoop framework. Some consider it to instead be a data store due to its lack of POSIX compliance, but it does provide shell commands and Java application programming interface (API) methods that are similar to other file systems. A Hadoop is divided into HDFS and MapReduce. HDFS is used for storing the data and MapReduce is used for the Processing the Data.\nHDFS has five services as follows:\n1. Name Node\n2. Secondary Name Node\n3. Job tracker\n4. Data Node\n5. Task Tracker\nTop three are Master Services/Daemons/Nodes and bottom two are Slave Services. Master Services can communicate with each other and in the same way Slave services can communicate with each other. Name Node is a master node and Data node is its corresponding Slave node and can talk with each other.\nName Node: HDFS consists of only one Name Node we call it as Master Node which can track the files, manage the file system and has the meta data and the whole data in it. To be particular Name node contains the details of the No. of blocks, Locations at what data node the data is stored and where the replications are stored and other details. As we have only one Name Node we call it as Single Point Failure. It has Direct connect with the client.\nData Node: A Data Node stores data in it as the blocks. This is also known as the slave node and it stores the actual data into HDFS which is responsible for the client to read and write. These are slave daemons. Every Data node sends a Heartbeat message to the Name node every 3 seconds and conveys that it is alive. In this way when Name Node does not receive a heartbeat from a data node for 2 minutes, it will take that data node as dead and starts the process of block replications on some other Data node.\nSecondary Name Node: This is only to take care of the checkpoints of the file system metadata which is in the Name Node. This is also known as the checkpoint Node. It is helper Node for the Name Node.\nJob Tracker: Basically Job Tracker will be useful in the Processing the data. Job Tracker receives the requests for Map Reduce execution from the client. Job tracker talks to the Name node to know about the location of the data like Job Tracker will request the Name Node for the processing the data. Name node in response gives the Meta data to job tracker.\nTask Tracker: It is the Slave Node for the Job Tracker and it will take the task from the Job Tracker. And also it receives code from the Job Tracker. Task Tracker will take the code and apply on the file. The process of applying that code on the file is known as Mapper.Hadoop cluster has nominally a single namenode plus a cluster of datanodes, although redundancy options are available for the namenode due to its criticality. Each datanode serves up blocks of data over the network using a block protocol specific to HDFS. The file system uses TCP/IP sockets for communication. Clients use remote procedure calls (RPC) to communicate with each other.\nHDFS stores large files (typically in the range of gigabytes to terabytes) across multiple machines. It achieves reliability by replicating the data across multiple hosts, and hence theoretically does not require redundant array of independent disks (RAID) storage on hosts (but to increase input-output (I/O) performance some RAID configurations are still useful). With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack. Data nodes can talk to each other to rebalance data, to move copies around, and to keep the replication of data high. HDFS is not fully POSIX-compliant, because the requirements for a POSIX file-system differ from the target goals of a Hadoop application. The trade-off of not having a fully POSIX-compliant file-system is increased performance for data throughput and support for non-POSIX operations such as Append.In May 2012, high-availability capabilities were added to HDFS, letting the main metadata server called the NameNode manually fail-over onto a backup. The project has also started developing automatic fail-overs.\nThe HDFS file system includes a so-called secondary namenode, a misleading term that some might incorrectly interpret as a backup namenode when the primary namenode goes offline. In fact, the secondary namenode regularly connects with the primary namenode and builds snapshots of the primary namenode's directory information, which the system then saves to local or remote directories. These checkpointed images can be used to restart a failed primary namenode without having to replay the entire journal of file-system actions, then to edit the log to create an up-to-date directory structure. Because the namenode is the single point for storage and management of metadata, it can become a bottleneck for supporting a huge number of files, especially a large number of small files. HDFS Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple namespaces served by separate namenodes. Moreover, there are some issues in HDFS such as small file issues, scalability problems, Single Point of Failure (SPoF), and bottlenecks in huge metadata requests.\nOne advantage of using HDFS is data awareness between the job tracker and task tracker. The job tracker schedules map or reduce jobs to task trackers with an awareness of the data location. For example: if node A contains data (a, b, c) and node X contains data (x, y, z), the job tracker schedules node A to perform map or reduce tasks on (a, b, c) and node X would be scheduled to perform map or reduce tasks on (x, y, z). This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer. When Hadoop is used with other file systems, this advantage is not always available. This can have a significant impact on job-completion times as demonstrated with data-intensive jobs.HDFS was designed for mostly immutable files and may not be suitable for systems requiring concurrent write operations.HDFS can be mounted directly with a Filesystem in Userspace (FUSE) virtual file system on Linux and some other Unix systems.\nFile access can be achieved through the native Java API, the Thrift API (generates a client in a number of languages e.g. C++, Java, Python, PHP, Ruby, Erlang, Perl, Haskell, C#, Cocoa, Smalltalk, and OCaml), the command-line interface, the HDFS-UI web application over HTTP, or via 3rd-party network client libraries.HDFS is designed for portability across various hardware platforms and for compatibility with a variety of underlying operating systems. The HDFS design introduces portability limitations that result in some performance bottlenecks, since the Java implementation cannot use features that are exclusive to the platform on which HDFS is running. Due to its widespread integration into enterprise-level infrastructure, monitoring HDFS performance at scale has become an increasingly important issue. Monitoring end-to-end performance requires tracking metrics from datanodes, namenodes, and the underlying operating system. There are currently several monitoring platforms to track HDFS performance, including Hortonworks, Cloudera, and Datadog.\n\n\n==== Other file systems ====\nHadoop works directly with any distributed file system that can be mounted by the underlying operating system by simply using a file:// URL; however, this comes at a price \u2013 the loss of locality. To reduce network traffic, Hadoop needs to know which servers are closest to the data, information that Hadoop-specific file system bridges can provide.\nIn May 2011, the list of supported file systems bundled with Apache Hadoop were:\n\nHDFS: Hadoop's own rack-aware file system. This is designed to scale to tens of petabytes of storage and runs on top of the file systems of the underlying operating systems.\nFTP file system: This stores all its data on remotely accessible FTP servers.\nAmazon S3 (Simple Storage Service) object storage: This is targeted at clusters hosted on the Amazon Elastic Compute Cloud server-on-demand infrastructure. There is no rack-awareness in this file system, as it is all remote.\nWindows Azure Storage Blobs (WASB) file system: This is an extension of HDFS that allows distributions of Hadoop to access data in Azure blob stores without moving the data permanently into the cluster.A number of third-party file system bridges have also been written, none of which are currently in Hadoop distributions. However, some commercial distributions of Hadoop ship with an alternative file system as the default \u2013 specifically IBM and MapR.\n\nIn 2009, IBM discussed running Hadoop over the IBM General Parallel File System. The source code was published in October 2009.\nIn April 2010, Parascale published the source code to run Hadoop against the Parascale file system.\nIn April 2010, Appistry released a Hadoop file system driver for use with its own CloudIQ Storage product.\nIn June 2010, HP discussed a location-aware IBRIX Fusion file system driver.\nIn May 2011, MapR Technologies Inc. announced the availability of an alternative file system for Hadoop, MapR FS, which replaced the HDFS file system with a full random-access read/write file system.\n\n\n=== JobTracker and TaskTracker: the MapReduce engine ===\n\nAtop the file systems comes the MapReduce Engine, which consists of one JobTracker, to which client applications submit MapReduce jobs. The JobTracker pushes work to available TaskTracker nodes in the cluster, striving to keep the work as close to the data as possible. With a rack-aware file system, the JobTracker knows which node contains the data, and which other machines are nearby. If the work cannot be hosted on the actual node where the data resides, priority is given to nodes in the same rack. This reduces network traffic on the main backbone network. If a TaskTracker fails or times out, that part of the job is rescheduled. The TaskTracker on each node spawns a separate Java virtual machine (JVM) process to prevent the TaskTracker itself from failing if the running job crashes its JVM. A heartbeat is sent from the TaskTracker to the JobTracker every few minutes to check its status. The Job Tracker and TaskTracker status and information is exposed by Jetty and can be viewed from a web browser.\nKnown limitations of this approach are:\n\nThe allocation of work to TaskTrackers is very simple. Every TaskTracker has a number of available slots (such as \"4 slots\"). Every active map or reduce task takes up one slot. The Job Tracker allocates work to the tracker nearest to the data with an available slot. There is no consideration of the current system load of the allocated machine, and hence its actual availability.\nIf one TaskTracker is very slow, it can delay the entire MapReduce job \u2013  especially towards the end, when everything can end up waiting for the slowest task. With speculative execution enabled, however, a single task can be executed on multiple slave nodes.\n\n\n==== Scheduling ====\nBy default Hadoop uses FIFO scheduling, and optionally 5 scheduling priorities to schedule jobs from a work queue. In version 0.19 the job scheduler was refactored out of the JobTracker, while adding the ability to use an alternate scheduler (such as the Fair scheduler or the Capacity scheduler, described next).\n\n\n===== Fair scheduler =====\nThe fair scheduler was developed by Facebook. The goal of the fair scheduler is to provide fast response times for small jobs and Quality of service (QoS) for production jobs. The fair scheduler has three basic concepts.\nJobs are grouped into pools.\nEach pool is assigned a guaranteed minimum share.\nExcess capacity is split between jobs.By default, jobs that are uncategorized go into a default pool. Pools have to specify the minimum number of map slots, reduce slots, as well as a limit on the number of running jobs.\n\n\n===== Capacity scheduler =====\nThe capacity scheduler was developed by Yahoo. The capacity scheduler supports several features that are similar to those of the fair scheduler.\nQueues are allocated a fraction of the total resource capacity.\nFree resources are allocated to queues beyond their total capacity.\nWithin a queue, a job with a high level of priority has access to the queue's resources.There is no preemption once a job is running.\n\n\n=== Difference between Hadoop 1 and Hadoop 2 (YARN) ===\nThe biggest difference between Hadoop 1 and Hadoop 2 is the addition of YARN (Yet Another Resource Negotiator), which replaced the MapReduce engine in the first version of Hadoop.\nYARN strives to allocate resources to various applications effectively. It runs two d\u00e6mons, which take care of two different tasks: the resource manager, which does job tracking and resource allocation to applications, the application master, which monitors progress of the execution.\n\n\n=== Difference between Hadoop 2 and Hadoop 3 ===\nThere are important features provided by Hadoop 3. For example, while there is one single namenode in Hadoop 2, Hadoop 3 enables having multiple name nodes, which solves the single point of failure problem.\nIn Hadoop 3, there are containers working in principle of Docker, which reduces time spent on application development.\nOne of the biggest changes is that Hadoop 3 decreases storage overhead with erasure coding.\nAlso, Hadoop 3 permits usage of GPU hardware within the cluster, which is a very substantial benefit to execute deep learning algorithms on a Hadoop cluster.\n\n\n=== Other applications ===\nThe HDFS file system is not restricted to MapReduce jobs. It can be used for other applications, many of which are under development at Apache. The list includes the HBase database, the Apache Mahout machine learning system, and the Apache Hive Data Warehouse system. Hadoop can, in theory, be used for any sort of work that is batch-oriented rather than real-time, is very data-intensive, and benefits from parallel processing of data. It can also be used to complement a real-time system, such as lambda architecture, Apache Storm, Flink and Spark Streaming.As of October 2009, commercial applications of Hadoop included:-\n\nlog and/or clickstream analysis of various kinds\nmarketing analytics\nmachine learning and/or sophisticated data mining\nimage processing\nprocessing of XML messages\nweb crawling and/or text processing\ngeneral archiving, including of relational/tabular data, e.g. for compliance\n\n\n== Prominent use cases ==\nOn 19 February 2008, Yahoo! Inc. launched what they claimed was the world's largest Hadoop production application. The Yahoo! Search Webmap is a Hadoop application that runs on a Linux cluster with more than 10,000 cores and produced data that was used in every Yahoo! web search query. There are multiple Hadoop clusters at Yahoo! and no HDFS file systems or MapReduce jobs are split across multiple data centers. Every Hadoop cluster node bootstraps the Linux image, including the Hadoop distribution. Work that the clusters perform is known to include the index calculations for the Yahoo! search engine. In June 2009, Yahoo! made the source code of its Hadoop version available to the open-source community.In 2010, Facebook claimed that they had the largest Hadoop cluster in the world with 21 PB of storage. In June 2012, they announced the data had grown to 100 PB and later that year they announced that the data was growing by roughly half a PB per day.As of 2013, Hadoop adoption had become widespread: more than half of the Fortune 50 companies used Hadoop.\n\n\n== Hadoop hosting in the cloud ==\nHadoop can be deployed in a traditional onsite datacenter as well as in the cloud. The cloud allows organizations to deploy Hadoop without the need to acquire hardware or specific setup expertise.\n\n\n== Commercial support ==\nA number of companies offer commercial implementations or support for Hadoop.\n\n\n=== Branding ===\nThe Apache Software Foundation has stated that only software officially released by the Apache Hadoop Project can be called Apache Hadoop or Distributions of Apache Hadoop. The naming of products and derivative works from other vendors and the term \"compatible\" are somewhat controversial within the Hadoop developer community.\n\n\n== Papers ==\nSome papers influenced the birth and growth of Hadoop and big data processing. Some of these are:\n\nJeffrey Dean, Sanjay Ghemawat (2004) MapReduce: Simplified Data Processing on Large Clusters, Google. This paper inspired Doug Cutting to develop an open-source implementation of the Map-Reduce framework. He named it Hadoop, after his son's toy elephant.\nMichael Franklin, Alon Halevy, David Maier (2005) From Databases to Dataspaces: A New Abstraction for Information Management. The authors highlight the need for storage systems to accept all data formats and to provide APIs for data access that evolve based on the storage system's understanding of the data.\nFay Chang et al. (2006) Bigtable: A Distributed Storage System for Structured Data, Google.\nRobert Kallman et al. (2008) H-store: a high-performance, distributed main memory transaction processing system\n\n\n== See also ==\n\nApache Accumulo \u2013 Secure Bigtable\nApache Cassandra, a column-oriented database that supports access from Hadoop\nApache CouchDB, a database that uses JSON for documents, JavaScript for MapReduce queries, and regular HTTP for an API\nApache HCatalog, a table and storage management layer for Hadoop\nBig data\nData Intensive Computing\nHPCC \u2013 LexisNexis Risk Solutions High Performance Computing Cluster\nHypertable \u2013 HBase alternative\nSector/Sphere \u2013 Open source distributed storage and processing\nSimple Linux Utility for Resource Management\n\n\n== References ==\n\n\n== Bibliography ==\n\n\n== External links ==\nOfficial website"}, {"job": "data scientist", "skill": "apache hive", "keywords": "apache hive", "description": "Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives a SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.\n\n\n== Features ==\nApache Hive supports analysis of large datasets stored in Hadoop's HDFS and compatible file systems such as Amazon S3 filesystem and Alluxio. It provides a SQL-like query language called HiveQL with schema on read and transparently converts queries to MapReduce, Apache Tez and Spark jobs. All three execution engines can run in Hadoop's resource negotiator, YARN (Yet Another Resource Negotiator). To accelerate queries, it provides indexes, including bitmap indexes.\nOther features of Hive include:\n\nIndexing to provide acceleration, index type including compaction and bitmap index as of 0.10, more index types are planned.\nDifferent storage types such as plain text, RCFile, HBase, ORC, and others.\nMetadata storage in a relational database management system, significantly reducing the time to perform semantic checks during query execution.\nOperating on compressed data stored into the Hadoop ecosystem using algorithms including DEFLATE, BWT, snappy, etc.\nBuilt-in user-defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions.\nSQL-like queries (HiveQL), which are implicitly converted into MapReduce or Tez, or Spark jobs.By default, Hive stores metadata in an embedded Apache Derby database, and other client/server databases like MySQL can optionally be used.The first four file formats supported in Hive were plain text, sequence file, optimized row columnar (ORC) format and RCFile. Apache Parquet can be read via plugin in versions later than 0.10 and natively starting at 0.13. Additional Hive plugins support querying of the Bitcoin Blockchain.\n\n\n== Architecture ==\nMajor components of the Hive architecture are:\n\nMetastore: Stores metadata for each of the tables such as their schema and location.  It also includes the partition metadata which helps the driver to track the progress of various data sets distributed over the cluster. The data is stored in a traditional RDBMS format. The metadata helps the driver to keep track of the data and it is crucial. Hence, a backup server regularly replicates the data which can be retrieved in case of data loss.\nDriver: Acts like a controller which receives the HiveQL statements. It starts the execution of the statement by creating sessions, and monitors the life cycle and progress of the execution. It stores the necessary metadata generated during the execution of a HiveQL statement. The driver also acts as a collection point of data or query results obtained after the Reduce operation.\nCompiler: Performs compilation of the HiveQL query, which converts the query to an execution plan.  This plan contains the tasks and steps needed to be performed by the Hadoop MapReduce to get the output as translated by the query. The compiler converts the query to an abstract syntax tree (AST). After checking for compatibility and compile time errors, it converts the AST to a directed acyclic graph (DAG). The DAG divides operators to MapReduce stages and tasks based on the input query and data.\nOptimizer: Performs various transformations on the execution plan to get an optimized DAG.  Transformations can be aggregated together, such as converting a pipeline of joins to a single join, for better performance. It can also split the tasks, such as applying a transformation on data before a reduce operation, to provide better performance and scalability. However, the logic of transformation used for optimization used can be modified or pipelined using another optimizer.\nExecutor: After compilation and optimization, the executor executes the tasks. It interacts with the job tracker of Hadoop to schedule tasks to be run. It takes care of pipelining the tasks by making sure that a task with dependency gets executed only if all other prerequisites are run.\nCLI, UI, and Thrift Server: A  command-line interface (CLI) provides a user interface for an external user to interact with Hive by submitting queries, instructions and monitoring the process status. Thrift server allows external clients to interact with Hive over a network, similar to the JDBC or ODBC protocols.\n\n\n== HiveQL ==\nWhile based on SQL, HiveQL does not strictly follow the full SQL-92 standard. HiveQL offers extensions not in SQL, including multitable inserts and create table as select, but only offers basic support for indexes. \nHiveQL lacked support for transactions and materialized views, and only limited subquery support. Support for insert, update, and delete with full ACID functionality was made available with release 0.14.Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce, Tez, or Spark jobs, which are submitted to Hadoop for execution.\n\n\n=== Example ===\n\n\n==== \"Word count\" program in Pig ====\n\n\n==== \"Word count\" program in HiveQL ====\nThe word count program counts the number of times each word occurs in the input. The word count can be written in HiveQL as:\n\nA brief explanation of each of the statements is as follows:\n\nChecks if table docs exists and drops it if it does. Creates a new table called docs with a single column of type STRING called line.\n\nLoads the specified file or directory (In this case \u201cinput_file\u201d) into the table. OVERWRITE specifies that the target table to which the data is being loaded into is to be re-written; Otherwise the data would be appended.\n\nThe query CREATE TABLE word_counts AS SELECT word, count(1) AS count creates a table called word_counts with two columns: word and  count. This query draws its input from the inner query (SELECT explode(split(line, '\\s')) AS word FROM docs) temp\". This query serves to split the input words into different rows of a temporary table aliased as temp. The GROUP BY WORD groups the results based on their keys. This results in the count column holding the number of occurrences for each word of the word column. The ORDER BY WORDS sorts the words alphabetically.\n\n\n== Comparison with traditional databases ==\nThe storage and querying operations of Hive closely resemble those of traditional databases. While Hive is a SQL dialect, there are a lot of differences in structure and working of Hive in comparison to relational databases. The differences are mainly because Hive is built on top of the Hadoop ecosystem, and has to comply with the restrictions of Hadoop and MapReduce.\nA schema is applied to a table in traditional databases. In such traditional databases, the table typically enforces the schema when the data is loaded into the table. This enables the database to make sure that the data entered follows the representation of the table as specified by the table definition. This design is called schema on write. In comparison, Hive does not verify the data against the table schema on write. Instead, it subsequently does run time checks when the data is read. This model is called schema on read. The two approaches have their own advantages and drawbacks. Checking data against table schema during the load time adds extra overhead, which is why traditional databases take a longer time to load data. Quality checks are performed against the data at the load time to ensure that the data is not corrupt. Early detection of corrupt data ensures early exception handling. Since the tables are forced to match the schema after/during the data load, it has better query time performance. Hive, on the other hand, can load data dynamically without any schema check, ensuring a fast initial load, but with the drawback of comparatively slower performance at query time. Hive does have an advantage when the schema is not available at the load time, but is instead generated later dynamically.Transactions are key operations in traditional databases. As any typical RDBMS, Hive supports all four properties of transactions (ACID): Atomicity, Consistency, Isolation, and Durability. Transactions in Hive were introduced in Hive 0.13 but were only limited to the partition level.  Recent version of Hive 0.14 were these functions fully added to support complete ACID properties. Hive 0.14 and later provides different row level transactions such as INSERT, DELETE and UPDATE. Enabling INSERT, UPDATE, DELETE transactions require setting appropriate values for configuration properties such as hive.support.concurrency, hive.enforce.bucketing, and hive.exec.dynamic.partition.mode.\n\n\n== Security ==\nHive v0.7.0 added integration with Hadoop security. Hadoop began using Kerberos authorization support to provide security. Kerberos allows for mutual authentication between client and server. In this system, the client's request for a ticket is passed along with the request. The previous versions of Hadoop had several issues such as users being able to spoof their username by setting the hadoop.job.ugi property and also MapReduce operations being run under the same user: hadoop or mapred. With Hive v0.7.0's integration with Hadoop security, these issues have largely been fixed. TaskTracker jobs are run by the user who launched it and the username can no longer be spoofed by setting the hadoop.job.ugi property. Permissions for newly created files in Hive are dictated by the HDFS. The Hadoop distributed file system authorization model uses three entities: user, group and others with three permissions: read, write and execute. The default permissions for newly created files can be set by changing the umask value for the Hive configuration variable hive.files.umask.value.\n\n\n== See also ==\nApache Pig\nSqoop\nApache Impala\nApache Drill\nApache Flume\nApache HBase\n\n\n== References ==\n\n\n== External links ==\nOfficial website"}, {"job": "data scientist", "skill": "data mining", "keywords": "data mining", "description": "Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information (with intelligent methods) from a data set and transform the information into a comprehensible structure for further use. Data mining is the analysis step of the \"knowledge discovery in databases\" process or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.The term \"data mining\" is a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence. The book Data mining: Practical machine learning tools and techniques with Java (which covers mostly machine learning material) was originally to be named just Practical machine learning, and the term data mining was only added for marketing reasons. Often the more general terms (large scale) data analysis and analytics \u2013 or, when referring to actual methods, artificial intelligence and machine learning \u2013 are more appropriate.\nThe actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps.\nThe difference between data analysis and data mining is that data analysis is used to test models and hypotheses on the dataset, e.g., analyzing the effectiveness of a marketing campaign, regardless of the amount of data; in contrast, data mining uses machine learning and statistical models to uncover clandestine or hidden patterns in a large volume of data.The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.\n\n\n== Etymology ==\nIn the 1960s, statisticians and economists used terms like data fishing or data dredging to refer to what they considered the bad practice of analyzing data without an a-priori hypothesis. The term \"data mining\" was used in a similarly critical way by economist Michael Lovell in an article published in the Review of Economic Studies in 1983. Lovell indicates that the practice \"masquerades under a variety of aliases, ranging from \"experimentation\" (positive) to \"fishing\" or \"snooping\" (negative).\nThe term data mining appeared around 1990 in the database community, generally with positive connotations. For a short time in 1980s, a phrase \"database mining\"\u2122, was used, but since it was trademarked by HNC, a San Diego-based company, to pitch their Database Mining Workstation; researchers consequently turned to data mining. Other terms used include data archaeology, information harvesting, information discovery, knowledge extraction, etc. Gregory Piatetsky-Shapiro coined the term \"knowledge discovery in databases\" for the first workshop on the same topic (KDD-1989) and this term became more popular in AI and machine learning community. However, the term data mining became more popular in the business and press communities. Currently, the terms data mining and knowledge discovery are used interchangeably.\nIn the academic community, the major forums for research started in 1995 when the First International Conference on Data Mining and Knowledge Discovery (KDD-95) was started in Montreal under AAAI sponsorship. It was co-chaired by Usama Fayyad and Ramasamy Uthurusamy. A year later, in 1996, Usama Fayyad launched the journal by Kluwer called Data Mining and Knowledge Discovery as its founding editor-in-chief. Later he started the SIGKDD Newsletter SIGKDD Explorations. The KDD International conference became the primary highest quality conference in data mining with an acceptance rate of research paper submissions below 18%. The journal Data Mining and Knowledge Discovery is the primary research journal of the field.\n\n\n== Background ==\nThe manual extraction of patterns from data has occurred for centuries. Early methods of identifying patterns in data include Bayes' theorem (1700s) and regression analysis (1800s). The proliferation, ubiquity and increasing power of computer technology have dramatically increased data collection, storage, and manipulation ability. As data sets have grown in size and complexity, direct \"hands-on\" data analysis has increasingly been augmented with indirect, automated data processing, aided by other discoveries in computer science, such as neural networks, cluster analysis, genetic algorithms (1950s), decision trees and decision rules (1960s), and support vector machines (1990s). Data mining is the process of applying these methods with the intention of uncovering hidden patterns in large data sets. It bridges the gap from applied statistics and artificial intelligence (which usually provide the mathematical background) to database management by exploiting the way data is stored and indexed in databases to execute the actual learning and discovery algorithms more efficiently, allowing such methods to be applied to ever-larger data sets.\n\n\n== Process ==\nThe knowledge discovery in databases (KDD) process is commonly defined with the stages:\n\nSelection\nPre-processing\nTransformation\nData mining\nInterpretation/evaluation.It exists, however, in many variations on this theme, such as the Cross-industry standard process for data mining (CRISP-DM) which defines six phases:\n\nBusiness understanding\nData understanding\nData preparation\nModeling\nEvaluation\nDeploymentor a simplified process such as (1) Pre-processing, (2) Data Mining, and (3) Results Validation.\nPolls conducted in 2002, 2004, 2007 and 2014 show that the CRISP-DM methodology is the leading methodology used by data miners. The only other data mining standard named in these polls was SEMMA. However, 3\u20134 times as many people reported using CRISP-DM. Several teams of researchers have published reviews of data mining process models, and Azevedo and Santos conducted a comparison of CRISP-DM and SEMMA in 2008.\n\n\n=== Pre-processing ===\nBefore data mining algorithms can be used, a target data set must be assembled. As data mining can only uncover patterns actually present in the data, the target data set must be large enough to contain these patterns while remaining concise enough to be mined within an acceptable time limit. A common source for data is a data mart or data warehouse. Pre-processing is essential to analyze the multivariate data sets before data mining. The target set is then cleaned. Data cleaning removes the observations containing noise and those with missing data.\n\n\n=== Data mining ===\nData mining involves six common classes of tasks:\nAnomaly detection (outlier/change/deviation detection) \u2013 The identification of unusual data records, that might be interesting or data errors that require further investigation.\nAssociation rule learning (dependency modeling) \u2013 Searches for relationships between variables. For example, a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.\nClustering \u2013 is the task of discovering groups and structures in the data that are in some way or another \"similar\", without using known structures in the data.\nClassification \u2013 is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as \"legitimate\" or as \"spam\".\nRegression \u2013 attempts to find a function that models the data with the least error that is, for estimating the relationships among data or datasets.\nSummarization \u2013 providing a more compact representation of the data set, including visualization and report generation.\n\n\n=== Results validation ===\n\nData mining can unintentionally be misused, and can then produce results that appear to be significant; but which do not actually predict future behavior and cannot be reproduced on a new sample of data and bear little use. Often this results from investigating too many hypotheses and not performing proper statistical hypothesis testing. A simple version of this problem in machine learning is known as overfitting, but the same problem can arise at different phases of the process and thus a train/test split\u2014when applicable at all\u2014may not be sufficient to prevent this from happening.\nThe final step of knowledge discovery from data is to verify that the patterns produced by the data mining algorithms occur in the wider data set. Not all patterns found by data mining algorithms are necessarily valid. It is common for data mining algorithms to find patterns in the training set which are not present in the general data set. This is called overfitting. To overcome this, the evaluation uses a test set of data on which the data mining algorithm was not trained. The learned patterns are applied to this test set, and the resulting output is compared to the desired output. For example, a data mining algorithm trying to distinguish \"spam\" from \"legitimate\" emails would be trained on a training set of sample e-mails. Once trained, the learned patterns would be applied to the test set of e-mails on which it had not been trained. The accuracy of the patterns can then be measured from how many e-mails they correctly classify. Several statistical methods may be used to evaluate the algorithm, such as ROC curves.\nIf the learned patterns do not meet the desired standards, subsequently it is necessary to re-evaluate and change the pre-processing and data mining steps. If the learned patterns do meet the desired standards, then the final step is to interpret the learned patterns and turn them into knowledge.\n\n\n== Research ==\nThe premier professional body in the field is the Association for Computing Machinery's (ACM) Special Interest Group (SIG) on Knowledge Discovery and Data Mining (SIGKDD). Since 1989, this ACM SIG has hosted an annual international conference and published its proceedings, and since 1999 it has published a biannual academic journal titled \"SIGKDD Explorations\".Computer science conferences on data mining include:\n\nCIKM Conference \u2013 ACM Conference on Information and Knowledge Management\nEuropean Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases\nKDD Conference \u2013 ACM SIGKDD Conference on Knowledge Discovery and Data MiningData mining topics are also present on many data management/database conferences such as the ICDE Conference, SIGMOD Conference and International Conference on Very Large Data Bases\n\n\n== Standards ==\nThere have been some efforts to define standards for the data mining process, for example, the 1999 European Cross Industry Standard Process for Data Mining (CRISP-DM 1.0) and the 2004 Java Data Mining standard (JDM 1.0). Development on successors to these processes (CRISP-DM 2.0 and JDM 2.0) was active in 2006 but has stalled since. JDM 2.0 was withdrawn without reaching a final draft.\nFor exchanging the extracted models \u2013 in particular for use in predictive analytics \u2013 the key standard is the Predictive Model Markup Language (PMML), which is an XML-based language developed by the Data Mining Group (DMG) and supported as exchange format by many data mining applications. As the name suggests, it only covers prediction models, a particular data mining task of high importance to business applications. However, extensions to cover (for example) subspace clustering have been proposed independently of the DMG.\n\n\n== Notable uses ==\n\nData mining is used wherever there is digital data available today. Notable examples of data mining can be found throughout business, medicine, science, and surveillance.\n\n\n== Privacy concerns and ethics ==\nWhile the term \"data mining\" itself may have no ethical implications, it is often associated with the mining of information in relation to peoples' behavior (ethical and otherwise).The ways in which data mining can be used can in some cases and contexts raise questions regarding privacy, legality, and ethics. In particular, data mining government or commercial data sets for national security or law enforcement purposes, such as in the Total Information Awareness Program or in ADVISE, has raised privacy concerns.Data mining requires data preparation which uncovers information or patterns which compromise confidentiality and privacy obligations. A common way for this to occur is through data aggregation. Data aggregation involves combining data together (possibly from various sources) in a way that facilitates analysis (but that also might make identification of private, individual-level data deducible or otherwise apparent). This is not data mining per se, but a result of the preparation of data before \u2013 and for the purposes of \u2013 the analysis. The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly compiled data set, to be able to identify specific individuals, especially when the data were originally anonymous.It is recommended to be aware of the following before data are collected:\nthe purpose of the data collection and any (known) data mining projects;\nhow the data will be used;\nwho will be able to mine the data and use the data and their derivatives;\nthe status of security surrounding access to the data;\nhow collected data can be updated.Data may also be modified so as to become anonymous, so that individuals may not readily be identified. However, even \"\"anonymized\" data sets can potentially contain enough information to allow identification of individuals, as occurred when journalists were able to find several individuals based on a set of search histories that were inadvertently released by AOL.The inadvertent revelation of personally identifiable information leading to the provider violates Fair Information Practices.   This indiscretion can cause financial,\nemotional, or bodily harm to the indicated individual.  In one instance of privacy violation, the patrons of Walgreens filed a lawsuit against the company in 2011 for selling\nprescription information to data mining companies who in turn provided the data\nto pharmaceutical companies.\n\n\n=== Situation in Europe ===\nEurope has rather strong privacy laws, and efforts are underway to further strengthen the rights of the consumers. However, the U.S.-E.U. Safe Harbor Principles currently effectively expose European users to privacy exploitation by U.S. companies. As a consequence of Edward Snowden's global surveillance disclosure, there has been increased discussion to revoke this agreement, as in particular the data will be fully exposed to the National Security Agency, and attempts to reach an agreement have failed.\n\n\n=== Situation in the United States ===\nIn the United States, privacy concerns have been addressed by the US Congress via the passage of regulatory controls such as the Health Insurance Portability and Accountability Act (HIPAA). The HIPAA requires individuals to give their \"informed consent\" regarding information they provide and its intended present and future uses. According to an article in Biotech Business Week, \"'[i]n practice, HIPAA may not offer any greater protection than the longstanding regulations in the research arena,' says the AAHC. More importantly, the rule's goal of protection through informed consent is approach a level of incomprehensibility to average individuals.\" This underscores the necessity for data anonymity in data aggregation and mining practices.\nU.S. information privacy legislation such as HIPAA and the Family Educational Rights and Privacy Act (FERPA) applies only to the specific areas that each such law addresses. The use of data mining by the majority of businesses in the U.S. is not controlled by any legislation.\n\n\n== Copyright law ==\n\n\n=== Situation in Europe ===\nUnder European copyright and database laws, the mining of in-copyright works (such as by web mining) without the permission of the copyright owner is not legal. Where a database is pure data in Europe, it may be that there is no copyright \u2013  but database rights may exist so data mining becomes subject to intellectual property owners' rights that are protected by the Database Directive. On the recommendation of the Hargreaves review, this led to the UK government to amend its copyright law in 2014 to allow content mining as a limitation and exception. The UK was the second country in the world to do so after Japan, which introduced an exception in 2009 for data mining. However, due to the restriction of the Information Society Directive (2001), the UK exception only allows content mining for non-commercial purposes. UK copyright law also does not allow this provision to be overridden by contractual terms and conditions. \nThe European Commission facilitated stakeholder discussion on text and data mining in 2013, under the title of Licences for Europe. The focus on the solution to this legal issue, such as licensing rather than limitations and exceptions, led to representatives of universities, researchers, libraries, civil society groups and open access publishers to leave the stakeholder dialogue in May 2013.\n\n\n=== Situation in the United States ===\nUS copyright law, and in particular its provision for fair use, means that content mining in America, as well as other fair use countries such as Israel, Taiwan and South Korea is viewed as being legal. As content mining is transformative, that is it does not supplant the original work, it is viewed as being lawful under fair use. For example, as part of the Google Book settlement the presiding judge on the case ruled that Google's digitisation project of in-copyright books was lawful, in part because of the transformative uses that the digitization project displayed - one being text and data mining.\n\n\n== Software ==\n\n\n=== Free open-source data mining software and applications ===\nThe following applications are available under free/open-source licenses. Public access to application source code is also available.\n\nCarrot2: Text and search results clustering framework.\nChemicalize.org: A chemical structure miner and web search engine.\nELKI: A university research project with advanced cluster analysis and outlier detection methods written in the Java language.\nGATE: a natural language processing and language engineering tool.\nKNIME: The Konstanz Information Miner, a user-friendly and comprehensive data analytics framework.\nMassive Online Analysis (MOA): a real-time big data stream mining with concept drift tool in the Java programming language.\nMEPX - cross-platform tool for regression and classification problems based on a Genetic Programming variant.\nML-Flex: A software package that enables users to integrate with third-party machine-learning packages written in any programming language, execute classification analyses in parallel across multiple computing nodes, and produce HTML reports of classification results.\nmlpack: a collection of ready-to-use machine learning algorithms written in the C++ language.\nNLTK (Natural Language Toolkit): A suite of libraries and programs for symbolic and statistical natural language processing (NLP) for the Python language.\nOpenNN: Open neural networks library.\nOrange: A component-based data mining and machine learning software suite written in the Python language.\nR: A programming language and software environment for statistical computing, data mining, and graphics. It is part of the GNU Project.\nscikit-learn is an open-source machine learning library for the Python programming language\nTorch: An open-source deep learning library for the Lua programming language and scientific computing framework with wide support for machine learning algorithms.\nUIMA: The UIMA (Unstructured Information Management Architecture) is a component framework for analyzing unstructured content such as text, audio and video \u2013 originally developed by IBM.\nWeka: A suite of machine learning software applications written in the Java programming language.\n\n\n=== Proprietary data-mining software and applications ===\nThe following applications are available under proprietary licenses.\n\nAngoss KnowledgeSTUDIO: data mining tool\nClarabridge:  text analytics product.\nLIONsolver: an integrated software application for data mining, business intelligence, and modeling that implements the Learning and Intelligent OptimizatioN (LION) approach.\nMegaputer Intelligence: data and text mining software is called PolyAnalyst.\nMicrosoft Analysis Services: data mining software provided by Microsoft.\nNetOwl: suite of multilingual text and entity analytics products that enable data mining.\nOracle Data Mining: data mining software by Oracle Corporation.\nPSeven: platform for automation of engineering simulation and analysis, multidisciplinary optimization and data mining provided by DATADVANCE.\nQlucore Omics Explorer: data mining software.\nRapidMiner: An environment for machine learning and data mining experiments.\nSAS Enterprise Miner: data mining software provided by the SAS Institute.\nSPSS Modeler: data mining software provided by IBM.\nSTATISTICA Data Miner: data mining software provided by StatSoft.\nTanagra: Visualisation-oriented data mining software, also for teaching.\nVertica: data mining software provided by Hewlett-Packard.\n\n\n== See also ==\nMethods\nApplication domains\nApplication examples\n\nRelated topicsData mining is about analyzing data; for information about extracting information out of data, see:\n\nOther resourcesInternational Journal of Data Warehousing and Mining\n\n\n== References ==\n\n\n== Further reading ==\n\n\n== External links ==\nKnowledge Discovery Software at Curlie\nData Mining Tool Vendors at Curlie"}, {"job": "data scientist", "skill": "tensorflow", "keywords": "tensorflow", "description": "TensorFlow is a free and open-source software library for dataflow and differentiable programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks. It is used for both research and production at Google.\u200d\u2009\u2009TensorFlow was developed by the Google Brain team for internal Google use. It was released under the Apache License 2.0 on November 9, 2015.\n\n\n== History ==\n\n\n=== DistBelief ===\nStarting in 2011, Google Brain built DistBelief as a proprietary machine learning system based on deep learning neural networks. Its use grew rapidly across diverse Alphabet companies in both research and commercial applications. Google assigned multiple computer scientists, including Jeff Dean, to simplify and refactor the codebase of DistBelief into a faster, more robust application-grade library, which became TensorFlow. In 2009, the team, led by Geoffrey Hinton, had implemented generalized backpropagation and other improvements which allowed generation of neural networks with substantially higher accuracy, for instance a 25% reduction in errors in speech recognition.\n\n\n=== TensorFlow ===\nTensorFlow is Google Brain's second-generation system. Version 1.0.0 was released on February 11, 2017.  While the reference implementation runs on single devices, TensorFlow can run on multiple CPUs and GPUs (with optional CUDA and SYCL extensions for general-purpose computing on graphics processing units). TensorFlow is available on 64-bit Linux, macOS, Windows, and mobile computing platforms including Android and iOS.\nIts flexible architecture allows for the easy deployment of computation across a variety of platforms (CPUs, GPUs, TPUs), and from desktops to clusters of servers to mobile and edge devices.\nTensorFlow computations are expressed as stateful dataflow graphs. The name TensorFlow derives from the operations that such neural networks perform on multidimensional data arrays, which are referred to as tensors. During the Google I/O Conference in June 2016, Jeff Dean stated that 1,500 repositories on GitHub mentioned TensorFlow, of which only 5 were from Google.In March 2018, Google announced TensorFlow.js version 1.0 for machine learning in JavaScript and TensorFlow Graphics for deep learning in computer graphics.In Jan 2019, Google announced TensorFlow 2.0. It became officially available in Sep 2019.\n\n\n=== Tensor processing unit (TPU) ===\nIn May 2016, Google announced its Tensor processing unit (TPU), an application-specific integrated circuit (a hardware chip) built specifically for machine learning and tailored for TensorFlow. TPU is a programmable AI accelerator designed to provide high throughput of low-precision arithmetic (e.g., 8-bit), and oriented toward using or running models rather than training them. Google announced they had been running TPUs inside their data centers for more than a year, and had found them to deliver an order of magnitude better-optimized performance per watt for machine learning.In May 2017, Google announced the second-generation, as well as the availability of the TPUs in Google Compute Engine. The second-generation TPUs deliver up to 180 teraflops of performance, and when organized into clusters of 64 TPUs, provide up to 11.5 petaflops.\nIn May 2018, Google announced the third-generation TPUs delivering up to 420 teraflops of performance and 128 GB HBM. Cloud TPU v3 Pods offer 100+ petaflops of performance and 32 TB HBM.In February 2018, Google announced that they were making TPUs available in beta on the Google Cloud Platform.\n\n\n=== Edge TPU ===\nIn July 2018, the Edge TPU was announced. Edge TPU is Google\u2019s purpose-built ASIC chip designed to run TensorFlow Lite machine learning (ML) models on small client computing devices such as smartphones known as edge computing.\n\n\n=== TensorFlow Lite ===\nIn May 2017, Google announced a software stack specifically for mobile development, TensorFlow Lite.  In January 2019, TensorFlow team released a developer preview of the mobile GPU inference engine with OpenGL ES 3.1 Compute Shaders on Android devices and Metal Compute Shaders on iOS devices. In May 2019, Google announced that their TensorFlow Lite Micro (also known as TensorFlow Lite for Microcontrollers) and ARM's uTensor would be merging.\n\n\n=== Pixel Visual Core (PVC) ===\nIn October 2017, Google released the Google Pixel 2 which featured their Pixel Visual Core (PVC), a fully programmable image, vision and AI processor for mobile devices. The PVC supports TensorFlow for machine learning (and Halide for image processing).\n\n\n=== Applications ===\nGoogle officially released RankBrain on October 26, 2015, backed by TensorFlow.\nGoogle also released Colaboratory, which is a TensorFlow Jupyter notebook environment that requires no setup to use.\n\n\n=== Machine Learning Crash Course (MLCC) ===\nOn March 1, 2018, Google released its Machine Learning Crash Course (MLCC). Originally designed to help equip Google employees with practical artificial intelligence and machine learning fundamentals, Google rolled out its free TensorFlow workshops in several cities around the world before finally releasing the course to the public.\n\n\n== Features ==\nTensorFlow provides stable Python (for version 3.7 across all platforms) and C APIs; and without API backwards compatibility guarantee: C++, Go, Java, JavaScript and Swift (early release).  Third-party packages are available for C#, Haskell, Julia, R, Scala, Rust, OCaml, and Crystal.\"New language support should be built on top of the C API. However, [..] not all functionality is available in C yet.\" Some more functionality is provided by the Python API.\n\n\n== Applications ==\n\nAmong the applications for which TensorFlow is the foundation, are automated image-captioning software, such as DeepDream. RankBrain now handles a substantial number of search queries, replacing and supplementing traditional static algorithm-based search results.\n\n\n== See also ==\nComparison of deep learning software\nDifferentiable programming\n\n\n== References ==\n\n\n== Bibliography ==\n\n\n== External links ==\nOfficial website"}, {"job": "data scientist", "skill": "data management", "keywords": "data management", "description": "Data Management comprises all disciplines related to managing data as a valuable resource.\n\n\n== Concept ==\nThe concept of data management arose in the 1980s as technology moved from sequential processing (first punched cards, then magnetic tape) to random access storage.  Since it was now possible to store a discrete fact and quickly access it using random access disk technology, those suggesting that data management was more important than business process management used arguments such as \"a customer's home address is stored in 75 (or some other large number) places in our computer systems.\"  However, during this period, random access processing was not competitively fast, so those suggesting \"process management\" was more important than \"data management\" used batch processing time as their primary argument. As application software evolved into real-time, interactive usage, it became obvious that both management processes were important.  If the data was not well defined, the data would be mis-used in applications. If the process wasn't well defined, it was impossible to meet user needs.\n\n\n== Topics ==\nTopics in data management include:\n\n\n== Usage ==\nIn modern management usage, the term data is increasingly replaced by information or even knowledge in a non-technical context. Thus data management has become information management or knowledge management. This trend obscures the raw data processing and renders interpretation implicit. The distinction between data and derived value is illustrated by the information ladder.\nHowever, data has staged a comeback with the popularisation of the term Big data, which refers to the collection and analyses of massive sets of data.\nSeveral organisations have established data management centers (DMC)\nfor their operations.\n\n\n== Integrated data management ==\nIntegrated data management (IDM) is a tools approach to facilitate data management and improve performance. IDM consists of an integrated, modular environment to manage enterprise application data, and optimize data-driven applications over its lifetime. IDM's purpose is to:\n\nProduce enterprise-ready applications faster\nImprove data access, speed iterative testing\nEmpower collaboration between architects, developers and DBAs\nConsistently achieve service level targets\nAutomate and simplify operations\nProvide contextual intelligence across the solution stack\nSupport business growth\nAccommodate new initiatives without expanding infrastructure\nSimplify application upgrades, consolidation and retirement\nFacilitate alignment, consistency and governance\nDefine business policies and standards up front;  share, extend, and apply throughout the lifecycle\n\n\n== See also ==\n\n\n== References ==\n\n\n== External links ==\n Media related to Data management at Wikimedia Commons\nData management at Curlie"}, {"job": "data scientist", "skill": "apache spark", "keywords": "apache spark", "description": "Apache Spark is an open-source distributed general-purpose cluster-computing framework. Spark provides an interface for programming entire clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.\n\n\n== Overview ==\nApache Spark has as its architectural foundation the Resilient Distributed Dataset (RDD), a read-only multiset of data items distributed over a cluster of machines, that is maintained in a fault-tolerant way. The Dataframe API was released as an abstraction on top of the RDD, followed by the Dataset API. In Spark 1.x, the RDD was the primary application programming interface (API), but as of Spark 2.x use of the Dataset API is encouraged even though the RDD API is not deprecated. The RDD technology still underlies the Dataset API.Spark and its RDDs were developed in 2012 in response to limitations in the MapReduce cluster computing paradigm, which forces a particular linear dataflow structure on distributed programs: MapReduce programs read input data from disk, map a function across the data, reduce the results of the map, and store reduction results on disk. Spark's RDDs function as a working set for distributed programs that offers a (deliberately) restricted form of distributed shared memory.Spark facilitates the implementation of both iterative algorithms, which visit their data set multiple times in a loop, and interactive/exploratory data analysis, i.e., the repeated database-style querying of data. The latency of such applications may be reduced by several orders of magnitude compared to Apache Hadoop MapReduce implementation.\nAmong the class of iterative algorithms are the training algorithms for machine learning systems, which formed the initial impetus for developing Apache Spark.Apache Spark requires a cluster manager and a distributed storage system. For cluster management, Spark supports standalone (native Spark cluster,  where you can launch a cluster either manually or use the launch scripts provided by the install package. It is also possible to run these daemons on a single machine for testing), Hadoop YARN, Apache Mesos or Kubernetes.  For distributed storage, Spark can interface with a wide variety, including Alluxio, Hadoop Distributed File System (HDFS), MapR File System (MapR-FS), Cassandra, OpenStack Swift, Amazon S3, Kudu, Lustre file system, or a custom solution can be implemented. Spark also supports a pseudo-distributed local mode, usually used only for development or testing purposes, where distributed storage is not required and the local file system can be used instead; in such a scenario, Spark is run on a single machine with one executor per CPU core.\n\n\n=== Spark Core ===\nSpark Core is the foundation of the overall project. It provides distributed task dispatching, scheduling, and basic I/O functionalities, exposed through an application programming interface (for Java, Python, Scala, and R) centered on the RDD abstraction (the Java API is available for other JVM languages, but is also usable for some other non-JVM languages that can connect to the JVM, such as Julia). This interface mirrors a functional/higher-order model of programming: a \"driver\" program invokes parallel operations such as map, filter or reduce on an RDD by passing a function to Spark, which then schedules the function's execution in parallel on the cluster. These operations, and additional ones such as joins, take RDDs as input and produce new RDDs. RDDs are immutable and their operations are lazy; fault-tolerance is achieved by keeping track of the \"lineage\" of each RDD (the sequence of operations that produced it) so that it can be reconstructed in the case of data loss. RDDs can contain any type of Python, Java, or Scala objects.\nBesides the RDD-oriented functional style of programming, Spark provides two restricted forms of shared variables: broadcast variables reference read-only data that needs to be available on all nodes, while accumulators can be used to program reductions in an imperative style.A typical example of RDD-centric functional programming is the following Scala program that computes the frequencies of all words occurring in a set of text files and prints the most common ones. Each map, flatMap (a variant of map) and reduceByKey takes an anonymous function that performs a simple operation on a single data item (or a pair of items), and applies its argument to transform an RDD into a new RDD.\n\n\n=== Spark SQL ===\nSpark SQL is a component on top of Spark Core that introduced a data abstraction called DataFrames, which provides support for structured and semi-structured data. Spark SQL provides a domain-specific language (DSL) to manipulate DataFrames in Scala, Java, or Python. It also provides SQL language support, with command-line interfaces and ODBC/JDBC server. Although DataFrames lack the compile-time type-checking afforded by RDDs, as of Spark 2.0, the strongly typed DataSet is fully supported by Spark SQL as well.\n\n\n=== Spark Streaming ===\nSpark Streaming uses Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD transformations on those mini-batches of data. This design enables the same set of application code written for batch analytics to be used in streaming analytics, thus facilitating easy implementation of lambda architecture. However, this convenience comes with the penalty of latency equal to the mini-batch duration. Other streaming data engines that process event by event rather than in mini-batches include Storm and the streaming component of Flink. Spark Streaming has support built-in to consume from Kafka, Flume, Twitter, ZeroMQ, Kinesis, and TCP/IP sockets.In Spark 2.x, a separate technology based on Datasets, called Structured Streaming, that has a higher-level interface is also provided to support streaming.Spark can be deployed in a traditional on-premises data center as well as in the cloud.\n\n\n=== MLlib Machine Learning Library ===\nSpark MLlib is a distributed machine-learning framework on top of Spark Core that, due in large part to the distributed memory-based Spark architecture, is as much as nine times as fast as the disk-based implementation used by Apache Mahout (according to benchmarks done by the MLlib developers against the alternating least squares (ALS) implementations, and before Mahout itself gained a Spark interface), and scales better than Vowpal Wabbit. An overview of Spark MLlib exists. Many common machine learning and statistical algorithms have been implemented and are shipped with MLlib which simplifies large scale machine learning pipelines, including:\n\nsummary statistics, correlations, stratified sampling, hypothesis testing, random data generation\nclassification and regression: support vector machines, logistic regression, linear regression, decision trees, naive Bayes classification, Decision Tree, Random Forest, Gradient-Boosted Tree\ncollaborative filtering techniques including alternating least squares (ALS)\ncluster analysis methods including k-means, and latent Dirichlet allocation (LDA)\ndimensionality reduction techniques such as singular value decomposition (SVD), and principal component analysis (PCA)\nfeature extraction and transformation functions\noptimization algorithms such as stochastic gradient descent, limited-memory BFGS (L-BFGS)\n\n\n=== GraphX ===\nGraphX is a distributed graph-processing framework on top of Apache Spark. Because it is based on RDDs, which are immutable, graphs are immutable and thus GraphX is unsuitable for graphs that need to be updated, let alone in a transactional manner like a graph database. GraphX provides two separate APIs for implementation of massively parallel algorithms (such as PageRank): a Pregel abstraction, and a more general MapReduce-style API. Unlike its predecessor Bagel, which was formally deprecated in Spark 1.6, GraphX has full support for property graphs (graphs where properties can be attached to edges and vertices).GraphX can be viewed as being the Spark in-memory version of Apache Giraph, which utilized Hadoop disk-based MapReduce.Like Apache Spark, GraphX initially started as a research project at UC Berkeley's AMPLab and Databricks, and was later donated to the Apache Software Foundation and the Spark project.\n\n\n== History ==\nSpark was initially started by Matei Zaharia at UC Berkeley's AMPLab in 2009, and open sourced in 2010 under a BSD license.In 2013, the project was donated to the Apache Software Foundation and switched its license to Apache 2.0. In February 2014, Spark became a Top-Level Apache Project.In November 2014, Spark founder M. Zaharia's company Databricks set a new world record in large scale sorting using Spark.Spark had in excess of 1000 contributors in 2015, making it one of the most active projects in the Apache Software Foundation and one of the most active open source big data projects.\n\n\n== See also ==\nList of concurrent and parallel programming APIs/Frameworks\n\n\n== Notes ==\n\n\n== References ==\n\n\n== External links ==\nOfficial website \nLearn Spark By Examples and Simple Github Projects www.sparkbyexamples.com"}, {"job": "data scientist", "skill": "data cleaning", "keywords": "data cleaning", "description": "Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data. Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting.\nAfter cleansing, a data set should be consistent with other similar data sets in the system. The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores. Data cleaning differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at the time of entry, rather than on batches of data.\n\nThe actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid postal code) or fuzzy (such as correcting records that partially match existing, known records). Some data cleansing solutions will clean data by cross-checking with a validated data set. A common data cleansing practice is data enhancement, where data is made more complete by adding related information. For example, appending addresses with any phone numbers related to that address. Data cleansing may also involve harmonization (or normalization) of data, which is the process of bringing together data of \"varying file formats, naming conventions, and columns\", and transforming it into one cohesive data set; a simple example is the expansion of abbreviations (\"st, rd, etc.\" to \"street, road, etcetera\").\n\n\n== Motivation ==\nAdministratively, incorrect or inconsistent data can lead to false conclusions and misdirect investments on both public and private scales. For instance, the government may want to analyze population census figures to decide which regions require further spending and investment on infrastructure and services. In this case, it will be important to have access to reliable data to avoid erroneous fiscal decisions. In the business world, incorrect data can be costly. Many companies use customer information databases that record data like contact information, addresses, and preferences. For instance, if the addresses are inconsistent, the company will suffer the cost of resending mail or even losing customers. The profession of forensic accounting and fraud investigating uses data cleansing in preparing its data and is typically done before data is sent to a data warehouse for further investigation. There are packages available so you can cleanse/wash address data while you enter it into your system. This is normally done via an application programming interface (API).\n\n\n== Data quality ==\nHigh-quality data needs to pass a set of quality criteria. Those include:\n\nValidity: The degree to which the measures conform to defined business rules or constraints (see also Validity (statistics)). When modern database technology is used to design data-capture systems, validity is fairly easy to ensure: invalid data arises mainly in legacy contexts (where constraints were not implemented in software) or where inappropriate data-capture technology was used (e.g., spreadsheets, where it is very hard to limit what a user chooses to enter into a cell, if cell validation is not used). Data constraints fall into the following categories:\nData-Type Constraints \u2013 e.g., values in a particular column must be of a particular data type, e.g., Boolean, numeric (integer or real), date, etc.\nRange Constraints: typically, numbers or dates should fall within a certain range. That is, they have minimum and/or maximum permissible values.\nMandatory Constraints: Certain columns cannot be empty.\nUnique Constraints: A field, or a combination of fields, must be unique across a dataset. For example, no two persons can have the same social security number.\nSet-Membership constraints: The values for a column come from a set of discrete values or codes. For example, a person's gender may be Female, Male or Unknown (not recorded).\nForeign-key constraints: This is the more general case of set membership. The set of values in a column is defined in a column of another table that contains unique values. For example, in a US taxpayer database, the \"state\" column is required to belong to one of the US's defined states or territories: the set of permissible states/territories is recorded in a separate State table. The term foreign key is borrowed from relational database terminology.\nRegular expression patterns: Occasionally, text fields will have to be validated this way. For example, phone numbers may be required to have the pattern (999) 999-9999.\nCross-field validation: Certain conditions that utilize multiple fields must hold. For example, in laboratory medicine, the sum of the components of the differential white blood cell count must be equal to 100 (since they are all percentages). In a hospital database, a patient's date of discharge from the hospital cannot be earlier than the date of admission.\nAccuracy: The degree of conformity of a measure to a standard or a true value - see also Accuracy and precision. Accuracy is very hard to achieve through data-cleansing in the general case because it requires accessing an external source of data that contains the true value: such \"gold standard\" data is often unavailable. Accuracy has been achieved in some cleansing contexts, notably customer contact data, by using external databases that match up zip codes to geographical locations (city and state) and also help verify that street addresses within these zip codes actually exist.\nCompleteness: The degree to which all required measures are known. Incompleteness is almost impossible to fix with data cleansing methodology: one cannot infer facts that were not captured when the data in question was initially recorded. (In some contexts, e.g., interview data, it may be possible to fix incompleteness by going back to the original source of data, i,e., re-interviewing the subject, but even this does not guarantee success because of problems of recall - e.g., in an interview to gather data on food consumption, no one is likely to remember exactly what one ate six months ago. In the case of systems that insist certain columns should not be empty, one may work around the problem by designating a value that indicates \"unknown\" or \"missing\", but the supplying of default values does not imply that the data has been made complete.\nConsistency: The degree to which a set of measures are equivalent in across systems (see also Consistency). Inconsistency occurs when two data items in the data set contradict each other: e.g., a customer is recorded in two different systems as having two different current addresses, and only one of them can be correct. Fixing inconsistency is not always possible: it requires a variety of strategies - e.g., deciding which data were recorded more recently, which data source is likely to be most reliable (the latter knowledge may be specific to a given organization), or simply trying to find the truth by testing both data items (e.g., calling up the customer).\nUniformity: The degree to which a set data measures are specified using the same units of measure in all systems ( see also Unit of measure). In datasets pooled from different locales, weight may be recorded either in pounds or kilos and must be converted to a single measure using an arithmetic transformation.The term integrity encompasses accuracy, consistency and some aspects of validation (see also data integrity) but is rarely used by itself in data-cleansing contexts because it is insufficiently specific. (For example, \"referential integrity\" is a term used to refer to the enforcement of foreign-key constraints above.)\n\n\n== Process ==\nData auditing: The data is audited with the use of statistical and database methods to detect anomalies and contradictions: this eventually indicates the characteristics of the anomalies and their locations. Several commercial software packages will let you specify constraints of various kinds (using a grammar that conforms to that of a standard programming language, e.g., JavaScript or Visual Basic) and then generate code that checks the data for violation of these constraints. This process is referred to below in the bullets \"workflow specification\" and \"workflow execution.\" For users who lack access to high-end cleansing software, Microcomputer database packages such as Microsoft Access or File Maker Pro will also let you perform such checks, on a constraint-by-constraint basis, interactively with little or no programming required in many cases.\nWorkflow specification: The detection and removal of anomalies are performed by a sequence of operations on the data known as the workflow. It is specified after the process of auditing the data and is crucial in achieving the end product of high-quality data. In order to achieve a proper workflow, the causes of the anomalies and errors in the data have to be closely considered.\nWorkflow execution: In this stage, the workflow is executed after its specification is complete and its correctness is verified. The implementation of the workflow should be efficient, even on large sets of data, which inevitably poses a trade-off because the execution of a data-cleansing operation can be computationally expensive.\nPost-processing and controlling: After executing the cleansing workflow, the results are inspected to verify correctness. Data that could not be corrected during the execution of the workflow is manually corrected, if possible. The result is a new cycle in the data-cleansing process where the data is audited again to allow the specification of an additional workflow to further cleanse the data by automatic processing.Good quality source data has to do with \u201cData Quality Culture\u201d and must be initiated at the top of the organization. It is not just a matter of implementing strong validation checks on input screens, because almost no matter how strong these checks are, they can often still be circumvented by the users. There is a nine-step guide for organizations that wish to improve data quality:\nDeclare a high-level commitment to a data quality culture\nDrive process reengineering at the executive level\nSpend money to improve the data entry environment\nSpend money to improve application integration\nSpend money to change how processes work\nPromote end-to-end team awareness\nPromote interdepartmental cooperation\nPublicly celebrate data quality excellence\nContinuously measure and improve data qualityOthers include:\n\nParsing:  for the detection of syntax errors. A parser decides whether a string of data is acceptable within the allowed data specification. This is similar to the way a parser works with grammars and languages.\nData transformation: Data transformation allows the mapping of the data from its given format into the format expected by the appropriate application. This includes value conversions or translation functions, as well as normalizing numeric values to conform to minimum and maximum values.\nDuplicate elimination: Duplicate detection requires an algorithm for determining whether data contains duplicate representations of the same entity. Usually, data is sorted by a key that would bring duplicate entries closer together for faster identification.\nStatistical methods: By analyzing the data using the values of mean, standard deviation, range, or clustering algorithms, it is possible for an expert to find values that are unexpected and thus erroneous. Although the correction of such data is difficult since the true value is not known, it can be resolved by setting the values to an average or other statistical value. Statistical methods can also be used to handle missing values which can be replaced by one or more plausible values, which are usually obtained by extensive data augmentation algorithms.\n\n\n== System ==\nThe essential job of this system is to find a suitable balance between fixing dirty data and maintaining the data as close as possible to the original data from the source production system. This is a challenge for the Extract, transform, load architect. The system should offer an architecture that can cleanse data, record quality events and measure/control quality of data in the data warehouse. A good start is to perform a thorough data profiling analysis that will help define to the required complexity of the data cleansing system and also give an idea of the current data quality in the source system(s).\n\n\n== Tools ==\nThere are lots of data cleansing tools like Trifacta, OpenRefine, Paxata, Alteryx, Data Ladder, WinPure and others. It's also common to use libraries like Pandas (software) for Python (programming language), or Dplyr for R (programming language). \nOne example of a data cleansing for distributed systems under Apache Spark is called Optimus, an OpenSource framework for laptop or cluster allowing pre-processing, cleansing, and exploratory data analysis.  It includes several data wrangling tools.\n\n\n== Quality screens ==\nPart of the data cleansing system is a set of diagnostic filters known as quality screens. They each implement a test in the data flow that, if it fails, records an error in the Error Event Schema. Quality screens are divided into three categories:\n\nColumn screens. Testing the individual column, e.g. for unexpected values like NULL values; non-numeric values that should be numeric; out of range values; etc.\nStructure screens. These are used to test for the integrity of different relationships between columns (typically foreign/primary keys) in the same or different tables. They are also used for testing that a group of columns is valid according to some structural definition to which it should adhere.\nBusiness rule screens. The most complex of the three tests. They test to see if data, maybe across multiple tables, follow specific business rules. An example could be, that if a customer is marked as a certain type of customer, the business rules that define this kind of customer should be adhered to.When a quality screen records an error, it can either stop the dataflow process, send the faulty data somewhere else than the target system or tag the data.\nThe latter option is considered the best solution because the first option requires, that someone has to manually deal with the issue each time it occurs and the second implies that data are missing from the target system (integrity) and it is often unclear what should happen to these data.\n\n\n== Criticism of existing tools and processes ==\nMost data cleansing tools have limitations in usability:\n\nProject costs: costs typically in the hundreds of thousands of dollars\nTime: mastering large-scale data-cleansing software is time-consuming\nSecurity: cross-validation requires sharing information, giving an application access across systems, including sensitive legacy systems\n\n\n== Error event schema ==\nThe Error Event schema holds records of all error events thrown by the quality screens. It consists of an Error Event Fact table with foreign keys to three dimension tables that represent date (when), batch job (where) and screen (who produced error). It also holds information about exactly when the error occurred and the severity of the error. Also, there is an Error Event Detail Fact table with a foreign key to the main table that contains detailed information about in which table, record and field the error occurred and the error condition.\n\n\n== See also ==\nData editing\nData mining\nRecord linkage\nSingle customer view\n\n\n== References ==\n\n\n== Sources ==\nHan, J., Kamber, M.  Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001. ISBN 1-55860-489-8.\nKimball, R., Caserta, J. The Data Warehouse ETL Toolkit, Wiley and Sons, 2004. ISBN 0-7645-6757-8.\nMuller H., Freytag J., Problems, Methods, and Challenges in Comprehensive Data Cleansing, Humboldt-Universitat zu Berlin, Germany, 2003.\nRahm, E., Hong, H. Data Cleaning: Problems and Current Approaches, University of Leipzig, Germany, 2000.\n\n\n== External links ==\nComputerworld: Data Scrubbing (February 10, 2003)\nErhard Rahm, Hong Hai Do: Data Cleaning: Problems and Current Approaches"}]