Thoughts on data science

Data science is one of the magic disciplines in this century. It is all about deriving insights from data. Data can be observed everywhere, but for insights, the observation needs to be 

  • explored or 
  • well-understood descriptions e.g. equations.  

Exploration is made via 

  • probability or
  • statistics. 

A probability assesses a single, observed event and statistics a large collection of observed events. 

In comparison, the descriptions of an observation description are made via 

  • calculation rules e.g. equations such as f(x) = y or
  • structures e.g. lines, spaces e.g. vectors, matrizes.  

Calculation rules are e.g. functions and structures point (1 dimensional) vectors (2 dimensional), matrizes (2 dimensional) or tensors (> 2 dimensional). 

One observation is also called a data point or actual value. 

For exploration as well as for understanding a fundamental question is an identity (function). How equal are two compared entities? The fundamental mathematical domain behind is called measure theory. It is everything about descriptive metrics e.g.

  • the length/ distance between two points in a given space,
  • the volume of a given multidimensional space or,
  • the area of a given space.  

Components

All approaches share the same components:

  • The objective that relates to data e.g. sales orders,
  • Data raise questions e.g.  to forecast sales orders, 
    • has observed properties (features, dimension)
    • has noise 
    • what is invariant (aka stays constant)
  • Model (aka. hypothesis space) as answers on the given questions e.g. predict forecast,
    • requires Training or applied, generalized Rules (e.g. Euclidean distance) and,
    • uses Algorithm to describe generalize the answer
      • Optimizer (e.g. quality metric, proportion metric, relationship metric) to improve the model performance while learning, 
        • The cost approach, as decrease of error – distance to objective, or
        • The objective approach, an increase of identity to an objective. 

Commonly known metrics are summarized in the so-called confusion matrix.

  • Evaluator (aka loss, cross-entropy, empiric loss, error function, empiric risk, target function) to check the quality while learning and,
  • Estimator to do the prediction while application. 

How to turn data properties into (valuable) information happens via so-called feature engineering. Today this consumes 60%-80% of most projects. 

Types Of Data 

It is important to understand the types of data values. This becomes obvious when preparing the data for use, often called encoding. Data can be categorized in four ways:

  • Nominal – you name it,
  • Ordinal – you order it, 
  • Rational – you can apply operations, and
  • Interval – you can say “many times” or “always in January”.  

The kind of data gives the first guidance on

  • better explore the data or 
  • better describe the understanding of it. 

The differences between the four are:

PropertyNominalOrdinalRationalInterval
CountableYesYesYesYes
OrderedYesYesYes
+ and – possibleYesYes
* and / possibleYes
modeYesYesYesYes
medianYesYesYes
meanYesYes
min/maxYesYesYes
Percentile  possibleYesYesYes
Variance possibleYesYes

Types of model

Due to Kevin Binz classification:

  • Connection-Model e.g. neural networks as the connection between function terms.
  • Analogically-Model e.g. extending or reducing the considered dimensions. Simplified a function gets more or fewer variables 
  • Probabilistic-Model as the percentage that some specific events appear 
  • Symbolistic-Model including everything that depends on rules (singular or in a tree) 
  • Last but not least, the natural way of Evolutionary-Model is to adapt or fail. The evolutionary approaches are often modeled via a so-called genetic algorithm. You find different ways to explain the approach e.g. by the “Survival of the fittest”.

All approaches have 5 steps in common:

  • Take an initial population
  • Determine fitness – in IT we require a mathematical definition of the fitness function 
  • Select parents – the two fittest one 
  • Make a crossover 
  • Let the mutation happen    

The example I always remember is the “Banana evolution”. I do not know who invented this example, but if you the one, just contact me and I will mention you. 

Banana evolution: Find the word “banana”

  • Given a population of words “bhihsihis”, “banoiidii”, …., “ugguana”
  • Take “banoiidii” and “ugguana” 
  • Make cross over e.g. “ban…” with “…ana”
  • Let the mutation happen “banana”. 

This approach is so powerful that it creates us as human species as well as for approaches like AutoML.

Types of learning

 Let us start with an example. Both of the following lines are observations of the same information. 

In the first line, we as humans learned (experience) to interpret the given images, in the second not. The second is only a bunch of data, not useable information (8 bit, the first 3 activated). 

When machines learn, it is finding a generalization for a specific task. The generalization is also called model. The difference between learning and intelligence is what intelligence implies. Intelligence implies to raise new questions. (Machine) Learning only answer on previously defined questions – the objective. 

Due to that, let us remember machine learning as a toolset for finding an optimal model for a specific purpose. Important is the extension “specific”. 

Primary three kinds of learning approaches are to distinguish:

  • Supervised Learning (task-driven) learn by the example of expected output. The objective of Supervised Learning is to find the best mapping from given inputs to expected, known outputs (aka labels, hypothesis). 
  • Unsupervised Learning (data-driven) discovering given pattern/structures via generalized, mathematical rules.
  • Reinforcement Learning (environment driven) based on feedback loop (aka interaction) with an agent, that is spending reward or punishment. The neural network tries to get the maximum rewards from the agent as possible. 

One specific instance of an agent is called a task. Tasks are distinguished in the following two groups:

  • Episodic: The task has a clear start end ending state. 
  • Continuous: The task has no terminating state – alive is the underlying target objective.

The reward system is distinguished in: 

  • Monte Carlo: If we collect the reward at the end of a task.
  • Temporary Difference (TD): Earning the reward at each step of a task. 

Reinforcement learning has one fundamental tradeoff – correctness vs. flexibility. In other words, performing a task which directly results in a reward or explores the environment to gain a reward later on. This is the reason why we call it reinforcement learning. To balance the tradeoff, a so-called reinforcement-function is introduced. Four variants are common:

  • Value-based: Determine the maximum expected (future) reward for each state. The most common approach is Q-learning. It is a kind of value-based learning. A cross-table of action (column) and states (rows) is developed. The cell contains the (future) reward. It is like a transition map in somehow comparable with the Markov probability chain. The probability is equal to the reward expectation – the table is like a cheat sheet. Due to that, the cell contains the Quality (Q-learning) of the next move. 
  • Policy-based: Determine the next action. 
  • Error based: Determine the loss of the next feasible action. 
  • Model-based: Reflect/ Learn the behavior of the environment, to know which tasks result in the maximized reward.

Common use-cases are: 

CategoryTargetUse Case
Supervised learningRegression, target variable known and continuousPredict house prices 
Supervised learningClassification, the target variable is known and categorical Visual object detection
Unsupervised learningClustering, target variable unknown, but intentions are knownCustomer segmentation
Unsupervised learning Association target variable unknown, but intentions are knownBasket analysis
Reinforcement learningClassification, the target variable is known and categorical Optimized marketing
Reinforcement Learning Control, target variable unknown but intention can be described Driverless cars

Summarized, learning is the sense of machine learning is only applied mathematics (aka algorithm) e.g. optimize the generalization findings. 

Optimization dilemma

Optimizing a model requires an understanding of its performance. A commonly expressed dilemma is the tradeoff between 

  • maximum accuracy (or minimal error) and 
  • model interpretability

More flexible model is harder to interpret, e.g. neural networks. But, the bad interpretable model tends to reach a higher precise (for a specific purpose). The reason for that is described via 

  • the information theory and 
  • the law of big numbers. 

The information theory asks what the given information content is. The answer depends on statistics and the distribution of the value instances. If for example the given value is 3 and most value instances (mean value) is from value 3, then the information content is high. Why is this important? Well, more data (law of big numbers) could lead to more considered value instances. 

Sub-clauses: 

  • This is why big data in some cases makes sense. 
  • Used machine learning datasets and the related learning is influenced by the standard distribution and the fact of how many percentiles of all possible variants (value instances) are contained in the training set. 
  • If the considered value distribution is decreased the model accuracy increases. This is often called a statistic lie.

In course of the dilemma the term “game theory” is often applied. Game theory if taking choice of a strategy – here for optimization. Summarized it is an ongoing comparison of populations vs. outcomes of applied strategies e.g. result strategy A = population X percentage * respective outcome + populution Y percentage * respective outcome +…

Optimization approaches

The following approaches utilities the optimization: 

  • Tuning helping to either find a good starting point and/or avoids that the model performance decreases (too fast) e.g. Bootstrapping. Bootstrapping comes while data preparation into the game.  The procedure is to randomly take samples out of test data and replace or mix available data (points). 

The mathematical foundation is given by the Central Limit Theorem (CLT) assuming that all samples (/observations) are close to each other. Also compare the t-test, mentioned in one of the previous chapters. 

  • Filter helps to reduce the amount of considered data. 
    • Distance (norm) approaches such as Euclidean distance, aforementioned correlation, linear discriminant analysis (LDA) or Chi-Square Test. 
    • Statistical approaches such as Kalman- and Bayes-filter
    • Space approaches such as dimensional reduction e.g. via Principal Component Analysis (PCA). 
  • Regularization, e.g. L1, L2 or drop out of already trained stuff, describing the early stopping (policy) while training. 
  • Ensembling as a method applied while training. It is about building subsets of data. Two major approaches exist:
    • Boosting such decision tree algorithm (e.g. applied if-then cases) like Adaboost puts weak learning models in sequential order and build one combined, strong model. The approach is to decrease the bias. 
    • Bagging or “bootstrap aggregation”, such as the Random-Forest algorithm, put multiple weak learning models in parallel and combine the results e.g. take multiple decision trees into consideration. Bagging takes trees of boosting and put them together to a (random) forest. Due to the fact that it decreases variance. Check previous chapters. 
    • To improve while execution, stacking supports. Stacking combines multiple base-models via a weighted majority e.g.
      • voting model (classification) or 
      • a weighted sum model (regression). 

The combining model is often called meta-model.

  • Kernel (aka filter), as each required pre- and post-calculation while training and execution. This is also called the kernel trick, but it is not a trick, only stupid math. For example, transform the input via a sinus operation or changing the dimensionality from e.g. 2 dimensions to 3 dimensions.

Also, check for the terms 

  • bias-variance tradeoff and
  • variance-noise tradeoff.

Approaches versus phase:

Phase1. Data preparation2. Model training3. Model execution
ApproachTuning– Regularization – Ensembling- Kernel– Ensembling – Kernel

Bias-variance tradeoff

Learning always has bias as a foundation, because the generalization is reflecting beliefs. The bias-variance trade-off explains this situation well. 

In a nutshell:

  • Memorize = Overfitting as the state that the model fully memorizes the training data and is not able to reproduce a result with other inputs. These models have a high variance (flexibility). 
  • Biased = Underfitting as the state that the model is not able to capture the given data. Here the model is too simple or improper for the underlying “new” question. These models have a high bias (strictness).

Due to that, data scientists have always to find a compromise. On one hand, feature selection supports to decrease model variance, on the other hand, larger training sets and considered features(e.g. big data approaches) tends to decrease bias. 

Variance-noise trade-off

Noisy data points can be categorized into four groups and may differ from the required variance (degrees of freedom). 

Normal dataNoisy data
Densityclose to dense/ centerfar away from the dense/ center
ClusterSimilar featuresNot similar features
PointA single data point is far from rest.
ContextualSpecific events e.g. always in December. 

Statistics (Exploring data)

Probability and statistics are about exploring observations:

  • Probability explores one observation in comparison to possible observations e.g. 1 out of 6 for throwing dice and
    • Probability distribution as a cartesian coordinate system with 
      • x = possible value instance and 
      • y = the probability (how many times the value instance is possible
    • Compare the next chapter 
  • Statistics explores multiple observations of features.
  • one feature (1 dimensional) or
    • Mean (aka expectation, level)  as the most probable value. It is 50% percentile. Percentile is 1% wide slices of all available data (points). 
    • Variance (2) a measure of distance (for each value in the set) from mean. Since variance measures the variability (volatility) from a mean. Due to that, it is a  measure of risk.
    • Standard deviation () is the square root of the variance. It describes the distribution around the mean. Also called standard error. Standard deviations are: 
      • 1 sigma () which is the space in between 68% = one standard deviation,
      • 2 sigma (): 95% of all data, 
      • 3 sigma (): 99% of all data resist,
      • etc. 

The space covered by all sigmas is called the confidence interval.

  • more than one feature (multidimensional)
    • co-variance as the variance of multiple dimensions – aka the relationship between at least 2 variables. Covariance describes the correlation –  if A change does B also change.  

Covariance: cov = (x-x) * (y-y)m with m = n-1 if data points are a subset of all possible or m = n if all possible data points are covered.

Also, consider the field of regression analysis and data descriptions via equations.

Based on the afore made statistical definitions, the following terms build upon each other stepwise:

  • Correlation (r) as the stepwise, e.g. per each new x-value, mean of the variance. It is also a metric for the linear dependency: r=covx-y
  • Significance (%) or effect size as comparing correlations of multiple observations.  Formula = (Mean observation 1 – mean observation 2)/(Mean observation 1 * 100).
  • Confidence interval as mean of significance (multiple). If this not changes over multiple steps, normal distributed data/observations are given. 
  • Hypotheses testing as comparing significance (multiple) of multiple observations e.g. A/B testing for 2 observations. Everything starts with a so-called null hypothesis H₀ about the populations and an alternative hypothesis H₁ negates the null hypothesis, as well as a test assessing the hypotheses and assuming a given distribution. Most popular is the Z-test. The Z-test bases on the foundation the Central Limit Theorem (CLT) assuming that the mean underlies a normal distribution, implying that the mean values describe the overall observation best. The Z value represents the number of standard deviations the observed difference of means is away from 0. The higher this number, the lesser the likelihood of H₀.

Each test has in advance calculated probability values (aka p-values), mapping probability to test result. References can be found e.g. in one of the following links

Probability (exploring data)

There are two different interpretations of probability:

  1. The common one is called a frequent interpretation. Here the quantified probability, of one specific event in a series of events, is estimated e.g. the number “1” of dice has the probability ⅙ – in a probability distribution diagram, e.g. bell curve of a normal distribution, it describes the covered area ( 1⁄6 % of the full area). 

The probability distribution for throwing a dice 2 times and counting the sum (x-axis is the sum, y-axis is the probability e.g. for sum with value 2 the probability is 1/ 36).

  1. The other mathematical framework quantifies uncertainty, it is called the Bayes inference – something overlaps or impacts something else. Here is the belief that each event is raised equally likely. Due to that conditioning the inference with respect to variants: 
  • 50% probability (aka belief) that the event happens and 
  • 50% probability (aka belief) that it not happen. 

These conditions are broken down – in a tree manner – into stages (tree level). These conditional probabilities are reflexive. This is what the Bayes Theorem rule expresses mathematically:

P(A | B) =P(B | A)  P(A)P(B)

Note: It is no magic, just put in the numbers and calculate the formula. 

  • P(A|B) as the likelihood that event A under state B appears or probability of A when B is given. This expression is also called posterior (distribution)
  • P(B|A) as the likelihood that the event B happens under a given state (aka hypothesis) A.
  • P(A) is the prior (distribution). Is the probability for the hypothesis that event A appears?
  • P(B) is the total or marginal likelihood that event B happens.  

In other words: 

  • B is the Event 
  • A is the Hypotheses 

Due to that it is used to calculate the probability of a hypothesis (A) being true, given that a certain event has happened (B).  

Commonly:

  • The event as the existence of a variable/ feature value. 
  • New state or Evidence 
  • (Conditional) Probability as measure how likely an event occurs.. 
  • Observation or Datapoint as the task of catching at least one event. 
  • Likelihood function estimates the probability if the observed value is matching the expected value (for discrete variables). For continuous variables/ features and the question of how likely the Expected Value is taken. The negative likelihood function is called Error Function
  • Entropy as the measure of uncertainty

Interpreting data

Besides the formal probability and statistical approach of data exploration a more practical and generalized approach for data exploration and feature engineering exist:

  • Data Exploration is about identifying a known pattern from the given data. It requires a deep business understanding. 
  • Feature Engineering discovers the information content, by systematically gaining a data understanding.  
  • Remove noise such as duplicates, error values, miss labeled one, contradictions, etc.
  • Fill missing values e.g. by deleting surrounding one, inter- or extrapolate with mean values or regression values, KNN interpolation, etc.
  • Handle noise e.g. by removing them, cutting them off, etc.  For numeric values, the rule of thumb is to take only those data points which are in a range of 68% of all data (points/values) – one standard deviation. Compare previous chapters. 
  • Transform data to a mathematical useable representation e.g. by squaring to remove negative values, (re-) scale them, etc.
    • Feature selection as the process of selecting the features which have possibly the most information content for the model. 

Also, compare dimensional reduction. 

It is most time equal to optimization. The easiest one is the Correlation between 

  • given and expected, 
  • input and output, 
  • etc. 

Due to that compare previous statistical chapter. 

  • Feature normalization aka scaling often required, because of
    • given data types – one of the previous chapters
    • changing the vector space e.g. exponential smoothing 
  • Feature translation aka encoding. One of the most popular use cases is text processing. 

Example:

  • Sentence 1: I eat fish
  • Sentence 2: I eat beef

Derived vector space with one hot encoding (1 if the word is in a sentence, 0 if the word is not in the sentence). The meta vector space is (I, eat, fish, beef). 

  • Sentence 1: (1,1,1,0)
  • Sentence 2: (1,1,0,1)

Important for encoding is the remaining information content (semantic). Due to that, a huge variety of different encoding approaches have been developed.  

The “Cross-industry standard process for data mining” (CRISP-DM) illustrates the relationship between 

  • business- and 
  • data understanding. 

A data scientist has to combine both capabilities. 

Feature translation

Feature Translation, also called encoding or vectorization, supports to ensure a numerical/ mathematical processable representation of given data (features). Here often descriptions of the data – compare introduction – via vectors and matrices are applied. 

Two mainly used approaches are :

  • Label encoding is good to keep processing time “optimal”. Encoding transfers a given value into a plain numeric representation e.g. man to 0 and woman to 1. 
  • One-Hot Encoding supports to keep features, linear independent. Never do a direct mapping of categorical variables, except they are binary e.g. man vs woman, because it brings an implicated relationship e.g. ship = 0, car = 1, airplane = 2, etc. This implies that cars are more related to ships as airplanes. Per default use one-hot-encoding. E.g. One-hot encoding creates a matrix of all variants: 
manwomanEncoding
0101 = woman
1010 = man

Sub-clauses: 

  • Once you understood the vectorization approach, most Machine Learning approaches are no big deal. 
  • Vectors and matrices are only one way to describe functions – compare introduction. 

Dependencies (Describing data)

One fundamental rule of describing data is the dependency between features (aka dimension) – also compare covariance. For an equation f(x) = y the 

  • dependent part is y, because it depends on the calculation and the input x. 
  • variable / dimension/ features x is the independent part. 

Mapped to a Cartesian coordinate system, the x-axis is the independent- and y-axis the dependent variable. 

Alternative terms are 

  • for dependent variable: effect, impact, action, response, target (function), outcome, ordinate, etc. 
  • for the independent variable: root cause, stimuli, treatment, regressor, predicator, abscesses, etc.

Regressions

A regression model is describing the dependency between at least two variables /dimensions/ features – if B changes the value, how does A change. 

These dependencies – compare statistics – can be also formulated via equations / rules. Equations can be constructed through components (e.g. straight lines). Each component is like a vector added to others. If the component is a linear regression, the component type is called “linear component”.

Linear components

Example of one linear regression component: 

y = m * x + b

The variable m is also called slope, weight or gradient = △y / △ x.

The set of all combined linear components (e.g. 3 times “a” in the following example) in multiple primary components (aka dimensions) are called equation system

Example: 

c = 3a + 2b, 3 times “a” and 2 times “b”

Vectors

Each of the components a, b and c can be seen as vectors, but three ways of thinking about a vector exist: 

  1. Physics think about a line, having a start point and pointing freely in space. The line has a length and a direction. 
  2. The computer scientists reshape it as an ordered list of elements, e.g. (A, B, C,…).
  3. Mathematicians generalizes the view on vector and adds operations to the definition. Last but not least and as a simplification to the physical view it has a fixed origin e.g. (0,0) in 2D space. This is also called root(ed). 

Forms os representing linear components:

  • Equation system representation

2*x1+1 * x2-1* x3=-1

0*x1-1* x2+1* x3=-2

  • Functional representation

f(x1,x2,x3) = ( 2* x1+1 * x2 + -1 * x3 ) + ( 0 * x1+ -1 * x2+ 1 * x3)

  • Geometric representation

Note: A matrix is a combination of vectors. In the example given 2 vectors. 

Vector Algebra

Also compare the following blog post https://www.holeoftherabbit.com/2022/06/14/linear-algebra-in-a-nutshell/

Let’s proceed with the mathematical approach of thinking and the linear algebra (perform operations/ transformations on linear components) stuff. Span is the set of all possible values, reachable by a transformation. 

Possible transformations are:

  • Addition: First vectors/ matrices can be added and subtracted, which change the length and direction. This is also called moving. 
  • Multiplication: Secondly, a vector/ matrix can be multiplied by a scalar (aka number), which increases or reduce the length. This is also called scaling. 

The determinant is the factor of the area increases or decreases the scaling when using a transformation matrix instead of a scalar e.g.

The stretch is 6. Commonly the eigenvalue determines the stretch factor of a transformation matrix. If the eigenvalue is 1, no stretch exists. 

If the determinant is 0, the rank (/dimensionality) can be reduced without losing information content – compare the previous chapter.

Multiple chained matrix transformations correspond to 

  • Combination: Performs at least two functions sequentially e.g. with x = 3, f(x)= x2  and g(x) = x – 2. g(x) + f(x) = results in 10.
  • Composition: Performs at least two functions nested e.g. with x = 3, f(x)= x2  and g(x) = x – 2, in composition f ∘ g results in f(g(x)) = 12  =  1 * 1 = 1.

One important part of linear algebra is equality of data. Let’s start with a vector pointing to a specific position in the respective space. Two vectors represent two points in space and their distance/ difference can be measured e.g. via Euclidean distance metric. Equality means if the distance is around 0.

Dimensional reduction

Equality of data (points, observations,etc.) is the funding question behind the dimensional reduction. Dimensions (aka features, variables, component) can be dropped if only less information content is lost. It doesn’t need to be exactly the same, but nearby. 

One Example of dimensional reduction is e.g. a cartesian coordinate system (2 dimensions), each dimension is a primary component, gets transformed into a bar chart (1 dimension).  

Common approaches are

  • Principal Component Analysis (PCA) answers the questions of which dimension has the maximum distance/ variance from its mean. The dimension with the greatest distance/variance is to drop.  
  • In comparison Linear Discriminant Analysis (LDA) is like PCA, but focusing on the maximum of separation. This is simply clustering. The K-Mean algorithm is slightly comparable. 

For text processing, it is used to determine topics for text documents. This is often called Topic Modelling

  • The most mentioned approach is Support Vector Machine (SVM). The idea is to find a (support) vector-function allowing to build cluster – do the necessary separation of data points. The term machine is used to explain the bunch of predefined functions, like f(x,y) = x*y and f(x,y) x+y. The objective is to take the function which in most results in 0.  
  • For any vector representing a multiple (e.g. by 2) of another one, one can be dropped e.g. (2,1,1) can be dropped if also (4,2,2) is in the vector space. 

Classification & clustering

With classification multiple classes are pre-defined (supervised). The model has to assign input data to one of these classes. The assignment is made via distance estimation, e.g. Euclidean distance, between the predefined classes (one) and the input. 

In comparison, clustering has no predefined classes. clusters are derived from the distance between all available data points. Clustering is often also called filtering – also compare the previous optimization chapter. One common use case if collaborative filtering, as used for (product) recommendation engines. The considered dimensions are mapped to a vector space (e.g. Cartesian coordinate system) – also called encoded. Also, compare the previous encoding chapter. 

Neural Networks 

The human applicability of imaginable models is limited by the feasible amount of dimensionality for input and output features. The complexity is triggered by the possibility to determine a functional calculus of nested linear component concatenations, but the concept itself is straight forward.  

Example:

Two linear components (aka principal components) for f(x,y) could be 

  1. 0.5*x + 0*y – 0.25
  1. 0*x + 0.5 * y – 0.25

The concatenation (without kernels such as sinus) could e.g. look like -0.033 * (0.5*x + 0*y- 0.25) + -0.19 * (0*x + 0.5*y- 0.25)  

The task for Neural Network Learning is to determine the optimal weights to classify data points (blue vs. white area). Weights are marked red in the following example: -0.033 * (0.5*x + 0*y- 0.25) + -0.19 * (0*x + 0.5*y- 0.25)  

The bias term is -0.25. 

The equation could be represented in a computation-graph f(x,y)=z

Basic terms are as follows:

  • Each feature is represented via a Node
  • Each arrow represents an Activation
  • Except for the input and output layer, each layer is called Hidden Layer
  • Each node has an input arrow is called Perceptron (linear combination, activation function) if in addition a normalization, e.g. the sum of all input weights, takes place. 
  • In addition to perceptron, Transfer Function ensuring the output value range e.g. from 0 to 1.  

Classical transfer functions are:

  • Sigmoid: = ex1 + ex
  • tanh: 2 * ( 2*x) -1
  • ReLu: max(0,x)
  • Each layer of weights (not nodes) is called Synapse
  • The input nodes are called Retina
  • If more than 1 hidden layer exists, the network is called Deep Neural Network

Neural network setups

The following types of neural networks are commonly used: 

  • Autoencoder (aka identity function) performing transformations, without losing the sparsity (lack or deficiency of information). Treat it as learnable filter e.g. drop the background of an object in a picture, learn typical data distribution that implies that noise is dropped out. Consists of two chained neural networks
    • Encoder, reducing from high dimensionality to lower dimensionality and
    • Decoder, extrapolation from lower dimensionality back to higher dimensionality. 
  • Recurrent Neural Network (RNN) learning sequences e.g. the Fibonacci sequence (1, 1, 2, 3, 5, 8, etc.). 

The formula is: Fn=Fn-1+Fn-2

Example: 1,1,2,3 (as sequence) with 2+3 (=5, as state) for the next sequence number. 

The “sequence part” (h1,h2,..), as well as the state, are both possible outputs. Due to that different input to output combinations are possible: 

  • one input to one output (state 2 state)
  • one input to many outputs (state 2 sequences)
  • many inputs to one output (sequence 2 states)
  • many inputs to many outputs (sequence 2 sequence)
  • Convolutional Neural Network (CNN) as a classical neural network extended by upfront kernel operations – the convolutional. The applied kernel focusing on summarizing (filter down) the input, often pictures, before it gets classified via the neural network. At each step, the convolutional filter is e.g. multiplied with the covered area. The covered area is called a patch. The movement step width and height are called stride

CNN does support an extensive amount of use cases, not only computer vision one. Nevertheless, it underperforms with variations e.g. rotations, who was not part of the neural network/ classifier training. The later on described Capsule Network closing these gaps. 

Besides, the variety of alternatives use cases the normal convolution can be easily extended. All these approaches distinguish two things:

  1. Image height and weight as the so-called spatial size of the feature map and
  2. the depth or channel map, e.g. R, G, B colors. 

Under consideration of the two most common convolution operation

  • patch size 3×3 and
  • patch size 1×1  

a major difference in the connectedness and therefore in the number of required operations exist between input and output. The connectedness is distinguished between locally connected and fully connected. 

  • A patch size of 3×3 results in the locally connected spatial domain and the fully connected channel domain. 
  • In comparison, the 1×1 patch size, also called pointwise or blending convolution, changes the amount of the channel. The computational cost reduces by a factor of 9 between 3×3 and 1×1 convolution. 
  • In comparison to CNN, the Capsule neural network introduce a so-called nested set of layers. The objective is to learn relationships between information segments e.g. a face always has two eyes directly beside each other and the mouth is below it. 
  • The Hierarchical Temporal Memory (HTM) neural networks or neocortex networks have similarity to convolution neural networks (CNN). A visual like Cortex Layer also called SDR– or sparse- layer materializes the input. Indifference to CNN, the Pooling kernel takes place between the layers (often an ongoing stream of) and not within one layer itself. 

The SDR layers are divided into segments. Each segment receives recurring the same information source. A bunch of SDR layer is summarized by pooling operation. The pooling result is the classifying neural network.  

  • Generative Adversarial Network (GAN) consists of two parallel networks:
    • Generator faking samples while training and
    • Discriminator, on the other hand, tries to distinguish between the real and fake samples while training.  

While execution the generator can generate new data samples and the discriminator can classify them (and other).

Equally to Autoencoder and GAN are generative models, learning the data distribution rather than its density like CNN do. Generally, GANs try to create representations that are sufficient to generalize the true data distribution conditioned upon a discriminator.

  • FaceNet introduces a specific CNN to do the required input data encoding. The CNN applies a trained loss function, allowing to “define an own distance function”. By doing so, the output vector can be used to calculate the similarity between two vectors e.g. via the Euclidean distance

Appendix A: Markov (Chain) Approximation

Due to Wikipedia, it is “a model where the next state is solely chosen based on the previous state”. Consider a graph of states. The Markov model estimates the probability to come from the state to another. 

Example: Consider the following transition graph:

A: B

B: C

C: ( D, E, F,G )

The (combined) probability to come from A to G is 1/1 * 1/1* ¼= ¼ = 25%

Appendix B. Causality

One of the most misunderstood terms are:

  • Correlation describing the relationship between two elements. An element could be an event, state, property, etc. This implies causality.
  • Causation is cause and effect that not need to imply coincidence.
  • Coincidence is the observation that two events appear in parallel. Coincidence also does not imply causality. If the number of equal observation increases, the probability increases. Nevertheless, it is not mandatory a causal relationship. 

Great examples can be found on the following site http://www.tylervigen.com/spurious-correlations

One example:

Source:  http://www.tylervigen.com/spurious-correlations 

The correlation coefficient is 99.7%, but still no causality. 

In addition to that, machine learning has an objective differ from the objective of statistics. Statistics is to understanding the impact of independent variables on depending on – the correlation. We try to find a way to describe the relationship as flexible as possible. Machine learning is about generalization. Generalizing the mapping (a.k.a. function) between inputs and outputs – this is not from a flexible character. The following video gives you an idea of causality: 

Take away: What is when the underlying mechanism change? Commonly your kernel-based machine learning approaches fail fairly. 

Appendix C: Time series

Time series data is the result of a stochastic process. e.g. like Markov chain. Time series are ordered by an integer and observed in this order – the order of time. Simplified, time series is one possible realization of a stochastic process. The data can be observed in two different manners: 

  • Contiguous: The observation made with same time difference e.g. each hour.
  • Discontiguous: The observation made with the differing time difference.

In both cases, one (univariate) or multiple variables (multivariate) are observed. Due to that, some common behavior can be described: 

  • Trend explains if the observed value increases or decreases in relation to time. 
  • Seasonality describes if and when, that a recurring pattern in the observed values exists.

Determining these metrics is called STL or statistical component decomposition. Commonly two approaches exist: 

  • Additive = Level + Trend + Seasonality + Noise 
  • Multiplicative =  Level * Trend * Seasonality * Noise. 

Also, the execution phase has some specific wording:

  • Single-step vs. Multistep: Forecast one or multiple values ahead
  • LAG: Lead time as the time span between last given data and the look-ahead e.g. LAG3 with basis month means 3 months ahead. 
  • Stationary: A stochastic process which (probability) distribution does not change when shifted in time. Or simplified mean and variance do not change over time (spans). 
  • Frequency: The width of the recurring cycle. 
  • Amplitude: Height of curve.
This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *