When studying correlations try to determine whether there is any relationship between two indicators in the same sample (for example, between the height and weight of children or between the level of IQ and school performance) or between two different samples (for example, when comparing pairs of twins), and if this relationship exists, then whether an increase in one indicator is accompanied by an increase (positive correlation) or a decrease (negative correlation) in the other.

In other words, correlation analysis helps to establish whether it is possible to predict the possible values ​​of one indicator, knowing the value of another.

Until now, when analyzing the results of our experience in studying the effects of marijuana, we have deliberately ignored such an indicator as reaction time. Meanwhile, it would be interesting to check whether there is a connection between the effectiveness of reactions and their speed. This would allow, for example, to assert that the slower a person is, the more accurate and efficient his actions will be and vice versa.

For this purpose, two different methods can be used: the parametric method of calculating the Bravais-Pearson coefficient (r) and calculation of the Spearman rank correlation coefficient (r s ), which applies to ordinal data, i.e., is nonparametric. However, let’s first understand what a correlation coefficient is.

Correlation coefficient

The correlation coefficient is a value that can vary from -1 to 1. In the case of a complete positive correlation, this coefficient is plus 1, and in the case of a completely negative correlation, it is minus 1. On the graph, this corresponds to a straight line passing through the points of intersection of the values ​​of each pair data:

Variable

If these points do not line up in a straight line, but form a “cloud,” the correlation coefficient in absolute value becomes less than one and, as this cloud is rounded, approaches zero:

If the correlation coefficient is 0, both variables are completely independent of each other.

In the humanities, a correlation is considered strong if its coefficient is greater than 0.60; if it exceeds 0.90, then the correlation is considered very strong. However, in order to be able to draw conclusions about the relationships between variables, the sample size is of great importance: the larger the sample, the more reliable the value of the obtained correlation coefficient. There are tables with critical values ​​of the Bravais-Pearson and Spearman correlation coefficient for different numbers of degrees of freedom (it is equal to the number of pairs minus 2, i.e. n-2). Only if the correlation coefficients are greater than these critical values ​​can they be considered reliable. So, in order for the correlation coefficient of 0.70 to be reliable, at least 8 pairs of data must be taken into the analysis ( = p - 2 = 6) when calculating r(Table B.4) and 7 pairs of data (= n - 2 = 5) when calculating r s (Table 5 in Appendix B. 5).

Bravais–Pearson coefficient

To calculate this coefficient, use the following formula (it may look different for different authors):

where  XY - the sum of the products of data from each pair;

n - number of pairs;

- average for the given variable X;

Average for variable data Y;

S X - x;

s Y - standard deviation for distribution u.

We can now use this coefficient to determine whether there is a relationship between the subjects' reaction time and the effectiveness of their actions. Take, for example, the background level of the control group.

n= 15  15,8  13,4 = 3175,8;

(n 1)S x S y = 14  3,07  2,29 = 98,42;

r =

A negative correlation coefficient may mean that the longer the reaction time, the lower the performance. However, its value is too small to allow us to talk about a reliable relationship between these two variables.

nXY=………

(n- 1)S X S Y = ……

What conclusion can be drawn from these results? If you think there is a relationship between the variables, is it direct or inverse? Is it reliable [see table 4 (in addition B. 5) with critical values r]?

Spearman's rank correlation coefficientr s

This coefficient is easier to calculate, but the results are less accurate than when using r. This is due to the fact that when calculating the Spearman coefficient, the order of the data is used, and not their quantitative characteristics and intervals between classes.

The point is that when using the rank correlation coefficient Spearman(r s ) they only check whether the ranking of data for any sample will be the same as in a number of other data for this sample, pairwise related to the first (for example, will students be “ranked” equally when they take both psychology and mathematics, or even with two different psychology teachers?). If the coefficient is close to + 1, then this means that both series are practically identical, and if this coefficient is close to - 1, we can talk about a complete inverse relationship.

Coefficient r s calculated by the formula

Where d- the difference between the ranks of conjugate feature values ​​(regardless of its sign), and n-number of pairs

Typically, this nonparametric test is used in cases where it is necessary to draw some conclusions not so much about intervals between the data, how much about them ranks, and also when the distribution curves are too asymmetrical and do not allow the use of parametric criteria such as coefficient r(in these cases it may be necessary to convert quantitative data into ordinal data).

Since this is the case with the distribution of efficiency and reaction time values ​​in the experimental group after exposure, you can repeat the calculations that you have already done for this group, only now not for the coefficient r, and for the indicator r s . This will allow you to see how different the two indicators are*.

*It should be remembered that

1) for the number of hits, rank 1 corresponds to the highest, and 15 to the lowest performance, while for reaction time, rank 1 corresponds to the shortest time, and 15 to the longest;

2) ex aequo data are given a medium rank.

Thus, as in the case of the coefficient r, a positive, although unreliable, result was obtained. Which of the two results is more plausible: r =-0.48 or r s = +0.24? This question can only arise if the results are reliable.

I would like to emphasize once again that the essence of these two coefficients is somewhat different. Negative coefficient r indicates that the efficiency is often higher, the shorter the reaction time, whereas when calculating the coefficient r s it was necessary to check whether faster subjects always respond more accurately, and slower ones - less accurately.

Since in the experimental group after exposure a coefficient was obtained r s , equal to 0.24, a similar trend is obviously not visible here. Try to understand the data for the control group after the intervention on your own, knowing that  d 2 = 122,5:

; Is it reliable?

What is your conclusion?…………………………………………………………………………………………………………………

…………………………………………………………………………………………………………………….

So, we have looked at various parametric and non-parametric statistical methods used in psychology. Our review was very superficial, and its main task was to make the reader understand that statistics are not as scary as they seem, and require mostly common sense. We remind you that the “experience” data we dealt with here is fictitious and cannot serve as a basis for any conclusions. However, such an experiment would really be worth conducting. Since a purely classical technique was chosen for this experiment, the same statistical analysis could be used in many different experiments. In any case, it seems to us that we have outlined some main directions that may be useful to those who do not know where to start with a statistical analysis of the results obtained.

There are three main branches of statistics: descriptive statistics, inductive statistics and correlation analysis.

IN scientific research Often there is a need to find a connection between outcome and factor variables (the yield of a crop and the amount of precipitation, the height and weight of a person in homogeneous groups by gender and age, heart rate and body temperature, etc.).

The second are signs that contribute to changes in those associated with them (the first).

The concept of correlation analysis

There are many Based on the above, we can say that correlation analysis is a method used to test the hypothesis about the statistical significance of two or more variables if the researcher can measure them, but not change them.

There are other definitions of the concept in question. Correlation analysis is a processing method that involves studying correlation coefficients between variables. In this case, correlation coefficients between one pair or many pairs of characteristics are compared to establish statistical relationships between them. Correlation analysis is a method for studying the statistical dependence between random variables with the optional presence of a strict functional nature, in which the dynamics of one random variable leads to the dynamics mathematical expectation another.

The concept of false correlation

When conducting correlation analysis, it is necessary to take into account that it can be carried out in relation to any set of characteristics, often absurd in relation to each other. Sometimes they have no causation with each other.

In this case, they talk about a false correlation.

Problems of correlation analysis

Based on the above definitions, the following tasks of the described method can be formulated: obtain information about one of the sought variables using another; determine the closeness of the relationship between the studied variables.

Correlation analysis involves determining the relationship between the characteristics being studied, and therefore the tasks of correlation analysis can be supplemented with the following:

  • identification of factors that have the greatest influence on the resulting characteristic;
  • identification of previously unexplored causes of connections;
  • construction of a correlation model with its parametric analysis;
  • study of the significance of communication parameters and their interval assessment.

Relationship between correlation analysis and regression

The method of correlation analysis is often not limited to finding the closeness of the relationship between the studied quantities. Sometimes it is supplemented by the compilation of regression equations, which are obtained using the analysis of the same name, and which represent a description of the correlation dependence between the resulting and factor (factor) characteristic (features). This method, together with the analysis under consideration, constitutes the method

Conditions for using the method

Effective factors depend on one to several factors. The correlation analysis method can be used if there is large number observations about the value of effective and factor indicators (factors), while the studied factors must be quantitative and reflected in specific sources. The first can be determined by the normal law - in this case, the result of the correlation analysis is the Pearson correlation coefficients, or, if the characteristics do not obey this law, the Spearman rank correlation coefficient is used.

Rules for selecting correlation analysis factors

When using this method it is necessary to determine the factors influencing performance indicators. They are selected taking into account the fact that there must be cause-and-effect relationships between the indicators. In the case of creating a multifactor correlation model, those that have a significant impact on the resulting indicator are selected, while it is preferable not to include interdependent factors with a pair correlation coefficient of more than 0.85 in the correlation model, as well as those for which the relationship with the resultant parameter is not linear or functional character.

Displaying results

The results of correlation analysis can be presented in text and graphic forms. In the first case they are presented as a correlation coefficient, in the second - in the form of a scatter diagram.

In the absence of correlation between the parameters, the points on the diagram are located chaotically, the average degree of connection is characterized by a greater degree of order and is characterized by a more or less uniform distance of the marked marks from the median. A strong connection tends to be straight and at r=1 the dot plot is a flat line. Reverse correlation differs in the direction of the graph from the upper left to the lower right, direct correlation - from the lower left to the upper right corner.

3D representation of a scatter plot

In addition to the traditional 2D scatter plot display, a 3D graphical representation of correlation analysis is now used.

A scatterplot matrix is ​​also used, which displays all paired plots in a single figure in a matrix format. For n variables, the matrix contains n rows and n columns. The chart located at the intersection of the i-th row and the j-th column is a plot of the variables Xi versus Xj. Thus, each row and column is one dimension, a single cell displays a scatterplot of two dimensions.

Assessing the tightness of the connection

The closeness of the correlation connection is determined by the correlation coefficient (r): strong - r = ±0.7 to ±1, medium - r = ±0.3 to ±0.699, weak - r = 0 to ±0.299. This classification is not strict. The figure shows a slightly different diagram.

An example of using the correlation analysis method

An interesting study was undertaken in the UK. It is devoted to the connection between smoking and lung cancer, and was carried out through correlation analysis. This observation is presented below.

Initial data for correlation analysis

Professional group

mortality

Farmers, foresters and fishermen

Miners and quarry workers

Manufacturers of gas, coke and chemicals

Manufacturers of glass and ceramics

Workers of furnaces, forges, foundries and rolling mills

Electrical and electronics workers

Engineering and related professions

Woodworking industries

Leatherworkers

Textile workers

Manufacturers of work clothes

Workers in the food, drink and tobacco industries

Paper and Print Manufacturers

Manufacturers of other products

Builders

Painters and decorators

Drivers of stationary engines, cranes, etc.

Workers not elsewhere included

Transport and communications workers

Warehouse workers, storekeepers, packers and filling machine workers

Office workers

Sellers

Sports and recreation workers

Administrators and managers

Professionals, technicians and artists

We begin correlation analysis. For clarity, it is better to start the solution with a graphical method, for which we will construct a scatter diagram.

It demonstrates a direct connection. However, it is difficult to draw an unambiguous conclusion based on the graphical method alone. Therefore, we will continue to perform correlation analysis. An example of calculating the correlation coefficient is presented below.

Using software (MS Excel will be described below as an example), we determine the correlation coefficient, which is 0.716, which means a strong connection between the parameters under study. Let's determine the statistical reliability of the obtained value using the corresponding table, for which we need to subtract 2 from 25 pairs of values, as a result we get 23 and using this line in the table we find r critical for p = 0.01 (since these are medical data, a more strict dependence, in other cases p=0.05 is sufficient), which is 0.51 for this correlation analysis. The example demonstrated that the calculated r is greater than the critical r, and the value of the correlation coefficient is considered statistically reliable.

Using software when conducting correlation analysis

The described type of statistical data processing can be carried out using software, in particular MS Excel. Correlation involves calculating the following parameters using functions:

1. The correlation coefficient is determined using the CORREL function (array1; array2). Array1,2 - cell of the interval of values ​​of the resultant and factor variables.

The linear correlation coefficient is also called the Pearson correlation coefficient, and therefore, starting with Excel 2007, you can use the function with the same arrays.

Graphical display of correlation analysis in Excel is done using the “Charts” panel with the “Scatter Plot” selection.

After specifying the initial data, we get a graph.

2. Assessing the significance of the pairwise correlation coefficient using Student’s t-test. The calculated value of the t-criterion is compared with the tabulated (critical) value of this indicator from the corresponding table of values ​​of the parameter under consideration, taking into account the specified level of significance and the number of degrees of freedom. This estimation is carried out using the function STUDISCOVER(probability; degrees_of_freedom).

3. Matrix of pair correlation coefficients. The analysis is carried out using the Data Analysis tool, in which Correlation is selected. Statistical assessment of pair correlation coefficients is carried out by comparing it absolute value with a tabular (critical) value. When the calculated pairwise correlation coefficient exceeds the critical one, we can say, taking into account a given degree of probability, that the null hypothesis of significance linear connection is not rejected.

In conclusion

The use of the correlation analysis method in scientific research allows us to determine the relationship between various factors and performance indicators. It is necessary to take into account that a high correlation coefficient can be obtained from an absurd pair or set of data, and therefore this type of analysis must be carried out on a sufficiently large array of data.

After obtaining the calculated value of r, it is advisable to compare it with the critical r to confirm the statistical reliability of a certain value. Correlation analysis can be carried out manually using formulas, or using software, in particular MS Excel. Here you can also construct a scatter diagram for the purpose of visually representing the relationship between the studied factors of correlation analysis and the resulting characteristic.

The correlation coefficient is the degree of relationship between two variables. Its calculation gives an idea of ​​whether there is a relationship between two data sets. Unlike regression, correlation does not predict the values ​​of quantities. However, calculating the coefficient is an important preliminary step statistical analysis. For example, we found that the correlation coefficient between the level of foreign direct investment and the GDP growth rate is high. This gives us the idea that in order to ensure prosperity, it is necessary to create a favorable climate specifically for foreign entrepreneurs. Not such an obvious conclusion at first glance!

Correlation and Causality

There is perhaps no other area of ​​statistics that has become so firmly entrenched in our lives. The correlation coefficient is used in all areas of social knowledge. Its main danger is that its high values ​​are often speculated on in order to convince people and make them believe in some conclusions. However, in fact, a strong correlation does not at all indicate a cause-and-effect relationship between quantities.

Correlation coefficient: Pearson and Spearman formula

There are several basic indicators that characterize the relationship between two variables. Historically, the first is the Pearson linear correlation coefficient. It is taught at school. It was developed by K. Pearson and J. Yule based on the work of Fr. Galton. This coefficient allows you to see the relationship between rational numbers, which change rationally. It is always greater than -1 and less than 1. A negative number indicates an inversely proportional relationship. If the coefficient is zero, then there is no relationship between the variables. Equal to a positive number - there is a directly proportional relationship between the quantities under study. Spearman's rank correlation coefficient allows you to simplify calculations by building a hierarchy of variable values.

Relationships between variables

Correlation helps answer two questions. First, whether the relationship between the variables is positive or negative. Secondly, how strong is the addiction. Correlation analysis is a powerful tool that can be used to obtain this important information. It is easy to see that family income and expenses fall and rise proportionally. This relationship is considered positive. On the contrary, when the price of a product rises, the demand for it falls. This relationship is called negative. The values ​​of the correlation coefficient range between -1 and 1. Zero means that there is no relationship between the values ​​under study. The closer the obtained indicator is to extreme values, the stronger the relationship (negative or positive). The absence of dependence is indicated by a coefficient from -0.1 to 0.1. You need to understand that such a value only indicates the absence of a linear relationship.

Features of application

The use of both indicators involves certain assumptions. Firstly, the presence of a strong connection does not determine the fact that one quantity determines the other. There may well be a third quantity that defines each of them. Secondly, a high Pearson correlation coefficient does not indicate a cause-and-effect relationship between the studied variables. Thirdly, it shows exclusively linear dependence. Correlation can be used to evaluate meaningful quantitative data (eg, barometric pressure, air temperature) rather than categories such as gender or favorite color.

Multiple correlation coefficient

Pearson and Spearman examined the relationship between two variables. But what to do if there are three or even more of them. This is where the multiple correlation coefficient comes to the rescue. For example, for gross national product influenced not only by foreign direct investment, but also by the state’s monetary and fiscal policies, as well as the level of exports. The growth rate and volume of GDP are the result of the interaction of a number of factors. However, it must be understood that the multiple correlation model is based on a number of simplifications and assumptions. Firstly, multicollinearity between values ​​is excluded. Secondly, the relationship between the dependent and the variables influencing it is considered linear.

Areas of use of correlation and regression analysis

This method of finding relationships between quantities is widely used in statistics. It is most often resorted to in three main cases:

  1. To test cause-and-effect relationships between the values ​​of two variables. As a result, the researcher hopes to discover a linear relationship and derive a formula that describes these relationships between quantities. Their units of measurement may be different.
  2. To check for a relationship between quantities. In this case, no one determines which variable is the dependent variable. It may turn out that some other factor determines the value of both quantities.
  3. To derive Eq. In this case, you can simply substitute numbers into it and find out the values ​​of the unknown variable.

A man in search of a cause-and-effect relationship

Consciousness is designed in such a way that we definitely need to explain the events that happen around us. A person always looks for a connection between the picture of the world in which he lives and the information he receives. The brain often creates order out of chaos. He can easily see a cause-and-effect relationship where there is none. Scientists have to specially learn to overcome this tendency. The ability to evaluate relationships between data objectively is essential in an academic career.

Media bias

Let's consider how the presence of a correlation can be misinterpreted. A group of British students with bad behavior were asked whether their parents smoked. Then the test was published in the newspaper. The result showed a strong correlation between parental smoking and their children's delinquency. The professor who conducted this study even suggested putting a warning about this on cigarette packs. However, there is a whole series problems with this conclusion. First, correlation does not show which of the quantities is independent. Therefore, it is quite possible to assume that the harmful habit of parents is caused by the disobedience of children. Secondly, it cannot be said with certainty that both problems did not arise due to some third factor. For example, low income families. It is worth noting the emotional aspect of the initial findings of the professor who conducted the study. He was an ardent opponent of smoking. Therefore, it is not surprising that he interpreted the results of his research in this way.

Conclusions

Misinterpreting a correlation as a cause-and-effect relationship between two variables can cause disgraceful research errors. The problem is that it lies at the very basis of human consciousness. Many marketing tricks are based on this feature. Understanding the difference between cause and effect and correlation allows you to rationally analyze information in both everyday life, and in a professional career.

Correlation coefficient formula

In progress economic activity human beings, a whole class of tasks was gradually formed to identify various statistical patterns.

It was necessary to assess the degree of determinism of some processes by others, it was necessary to establish the close interdependence between different processes and variables.
Correlation is the relationship of variables to each other.

To assess the closeness of the relationship, a correlation coefficient was introduced.

Physical meaning of the correlation coefficient

The correlation coefficient has a clear physical meaning if the statistical parameters of the independent variables obey a normal distribution; graphically, such a distribution is represented by a Gaussian curve. And the dependence is linear.

The correlation coefficient shows how determined one process is by another. Those. When one process changes, how often does the dependent process change. It doesn’t change at all – there is no dependence, it changes immediately every time – complete dependence.

The correlation coefficient can take values ​​in the range [-1:1]

Null value coefficient means that there is no relationship between the variables under consideration.
The extreme values ​​of the range indicate complete dependence between the variables.

If the coefficient value is positive, then the relationship is direct.

For a negative coefficient, the opposite is true. Those. in the first case, when the argument changes, the function changes proportionally, in the second case, it changes inversely.
When the correlation coefficient value is in the middle of the range, i.e. from 0 to 1, or from -1 to 0, they speak of incomplete functional dependence.
The closer the coefficient value is to the extremes, the greater the relationship between the variables or random values. The closer the value is to 0, the less interdependence there is.
Usually the correlation coefficient takes intermediate values.

The correlation coefficient is an immeasurable quantity

The correlation coefficient is used in statistics, in correlation analysis, to test statistical hypotheses.

By putting forward some statistical hypothesis about the dependence of one random variable on another, the correlation coefficient is calculated. Based on it, it is possible to make a judgment about whether there is a relationship between the quantities and how close it is.

The fact is that it is not always possible to see the relationship. Often quantities are not directly related to each other, but depend on many factors. However, it may turn out that through many indirect connections random variables turn out to be interdependent. Of course, this may not mean their direct connection; for example, if the intermediary disappears, the dependence may also disappear.

In Chapter 4, we looked at basic univariate descriptive statistics—measures of central tendency and variability that are used to describe a single variable. In this chapter we will look at the main correlation coefficients.

Correlation coefficient- two-dimensional descriptive statistics, a quantitative measure of the relationship (joint variability) of two variables.

The history of the development and application of correlation coefficients for the study of relationships actually began simultaneously with the emergence of the measurement approach to the study of individual differences - in 1870-1880. The pioneer in measuring human abilities, as well as the author of the term “correlation coefficient” itself, was Francis Galton, and the most popular correlation coefficients were developed by his follower Karl Pearson. Since then, the study of relationships using correlation coefficients has been one of the most popular activities in psychology.

To date, a great variety of different correlation coefficients have been developed, and hundreds of books are devoted to the problem of measuring relationships with their help. Therefore, without pretending to be complete, we will consider only the most important, truly irreplaceable in research measures of connection - Pearson's, Spearman's and Kendall's." Their common feature is that they reflect the relationship between two characteristics measured on a quantitative scale - rank or metric.

Generally speaking, any empirical research focuses on examining the relationships between two or more variables.

EXAMPLES

Let us give two examples of research into the effect of showing scenes of violence on TV on the aggressiveness of adolescents. 1. The relationship between two variables measured on a quantitative (rank or metric) scale is studied: 1) “time of watching violent television programs”; 2) “aggression”.

Reads like Kendall's tau.


CHAPTER 6. CORRELATION COEFFICIENTS

2. The difference in aggressiveness of 2 or more groups of adolescents, differing in the duration of viewing television programs with scenes of violence, is studied.

In the second example, the study of differences can be presented as a study of the relationship between 2 variables, one of which is nominative (duration of watching TV shows). And for this situation, our own correlation coefficients have also been developed.

Any research can be reduced to the study of correlations; fortunately, a variety of correlation coefficients have been invented for almost any research situation. But in the following presentation we will distinguish between two classes of problems:

P study of correlations - when two variables are presented on a numerical scale;

study of differences - when at least one of the two variables is presented in a nominative scale.


This division also corresponds to the logic of constructing popular computer statistical programs, in which the menu Correlations three coefficients are proposed (Pearson's r, Spearman's r, and Kendall's r), and methods for group comparisons are proposed to solve other research problems.

THE CONCEPT OF CORRELATION

Relationships in the language of mathematics are usually described using functions, which are graphically represented as lines. In Fig. Figure 6.1 shows several function graphs. If a change in one variable by one unit always changes another variable by the same amount, the function is linear(its graph represents a straight line); any other connection - nonlinear. If an increase in one variable is associated with an increase in another, then the relationship is positive (direct); if an increase in one variable is associated with a decrease in another, then the relationship is negative (reverse). If the direction of change of one variable does not change with the increase (decrease) of another variable, then such a function is monotonous; otherwise the function is called non-monotonic.

Functional connections similar to those shown in Fig. 6.1 are idealizations. Their peculiarity is that one value of one variable corresponds to a strictly defined value of another variable. For example, this is the relationship between two physical variables - weight and body length (linear positive). However, even in physical experiments, the empirical relationship will differ from the functional relationship due to unaccounted for or unknown reasons: fluctuations in the composition of the material, measurement errors, etc.

Rice. 6.1. Examples of graphs of frequently occurring functions

In psychology, as in many other sciences, when studying the relationship of signs, many possible reasons for the variability of these signs inevitably fall out of the field of view of the researcher. The result is that even The functional connection between variables that exists in reality acts empirically as probabilistic (stochastic): the same value of one variable corresponds to the distribution of different values ​​of another variable (and vice versa). The simplest example is the ratio of height and weight of people. Empirical results of studying these two characteristics will show, of course, their positive relationship. But it’s easy to guess that it will differ from the strict, linear, positive - ideal mathematical function, even with all the researcher’s tricks to take into account the slimness or fatness of the subjects. (It is unlikely that on this basis it would occur to anyone to deny the fact of the existence of a strict functional connection between the length and weight of the body.)

So, in psychology, as in many other sciences, the functional relationship of phenomena can be empirically identified only as a probabilistic connection of the corresponding characteristics. A clear idea of ​​the nature of the probabilistic connection is given by scatter diagram - a graph whose axes correspond to the values ​​of two variables, and each subject represents a point (Fig. 6.2). Correlation coefficients are used as a numerical characteristic of a probabilistic relationship.