outlier analysis是否影响correlation

Correlation&and&dependence
is a measure of
relationship between two mathematical variables or measured data
values, which includes the
as a special case.
In , dependence refers to any
statistical relationship between two
or two sets of .
Correlation refers to any of a broad class of statistical
relationships involving dependence.
Familiar examples of dependent phenomena include the correlation
between the physical
of parents and their offspring, and the
correlation between the
for a product and its price. Correlations
are useful because they can indicate a predictive relationship that
can be exploited in practice. For example, an electrical utility
may produce less power on a mild day based on the correlation
between electricity demand and weather. In this example there is a
, because extreme weather causes
people to use more electricity fo however,
statistical dependence is not sufficient to demonstrate the
presence of such a causal relationship (i.e., ).
Formally, dependence refers to any situation in which random
variables do not satisfy a mathematical condition of . In loose usage, correlation can refer to
any departure of two or more random variables from independence,
but technically it refers to any of several more specialized types
of relationship between . There are several
correlation coefficients, often denoted ρ or
r, measuring the degree of correlation. The most common of
these is the , which is sensitive only to a linear
relationship between two variables (which may exist even if one is
a nonlinear function of the other). Other correlation coefficients
have been developed to be more
than the Pearson
correlation&& that is, more sensitive to nonlinear
relationships.
The most familiar measure of dependence between two quantities
is the , or "Pearson's
correlation." It is obtained by dividing the
of the two variables by the product of
developed the coefficient from a
similar but slightly different idea by .
The population correlation coefficient ρX,Y
between two
and σY is defined as:
<img ALT="\rho_{X,Y}=\mathrm{corr}(X,Y)={\mathrm{cov}(X,Y) \over \sigma_X \sigma_Y} ={E[(X-\mu_X)(Y-\mu_Y)] \over \sigma_X\sigma_Y}," src="/blog7style/images/common/sg_trans.gif" real_src ="http://upload.wikimedia.org/math/0/7/6/076d3820a46afe55ee680f3c85e34c76.png"
TITLE="Correlation&and&dependence" />
where E is the
operator, cov means
, and, corr a widely used
alternative notation for Pearson's correlation.
The Pearson correlation is defined only if both of the standard
deviations are finite and both of them are nonzero. It is a
corollary of the
that the correlation cannot exceed 1 in . The correlation coefficient is
symmetric:
corr(X,Y)&=&corr(Y,X).
The Pearson correlation is +1 in the case of a perfect positive
(increasing) linear relationship (correlation), &1 in the case of a
perfect decreasing (negative) linear relationship
(anticorrelation),
and some value between &1 and 1 in all other cases, indicating the
between the
variables. As it approaches zero there is less of a relationship
(closer to uncorrelated). The closer the coefficient is to either
&1 or 1, the stronger the correlation between the variables.
If the variables are , Pearson's correlation
coefficient is 0, but the converse is not true because the
correlation coefficient detects only linear dependencies between
two variables. For example, suppose the random variable X is
symmetrically distributed about zero, and Y =
X2. Then Y is completely determined by
X, so that X and Y are perfectly dependent,
but their they are . However, in the special case when
X and Y are ,
uncorrelatedness is equivalent to independence.
If we have a series of n measurements of X and
Y written as xi and yi
where i = 1, 2, ..., n, then the sample
correlation coefficient can be used to estimate the population
Pearson correlation r between X and Y. The
sample correlation coefficient is written
<img ALT=" r_{xy}=\frac{\sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})}{(n-1) s_x s_y} =\frac{\sum\limits_{i=1}^n (x_i-\bar{x})(y_i-\bar{y})} {\sqrt{\sum\limits_{i=1}^n (x_i-\bar{x})^2 \sum\limits_{i=1}^n (y_i-\bar{y})^2}}, " src="/blog7style/images/common/sg_trans.gif" real_src ="http://upload.wikimedia.org/math/e/d/f/edff6b11be.png"
TITLE="Correlation&and&dependence" />
where x and
y are the sample
of X and Y, and
sx and sy are the
This can also be written as:
<img ALT=" r_{xy}=\frac{\sum x_iy_i-n \bar{x} \bar{y}}{(n-1) s_x s_y}=\frac{n\sum x_iy_i-\sum x_i\sum y_i} {\sqrt{n\sum x_i^2-(\sum x_i)^2}~\sqrt{n\sum y_i^2-(\sum y_i)^2}}. " src="/blog7style/images/common/sg_trans.gif" real_src ="http://upload.wikimedia.org/math/c/a/6/ca68fbe4b380c9bc4e27.png"
TITLE="Correlation&and&dependence" />
If x and y are results of measurements that
contain measurement error, the realistic limits on the correlation
coefficient are not &1 to +1 but a smaller range.
Rank correlation coefficients
Main articles:
coefficients, such as
measure the extent to which, as one variable increases, the other
variable tends to increase, without requiring that increase to be
represented by a linear relationship. If, as the one variable
increases, the other decreases, the rank correlation
coefficients will be negative. It is common to regard these rank
correlation coefficients as alternatives to Pearson's coefficient,
used either to reduce the amount of calculation or to make the
coefficient less sensitive to non-normality in distributions.
However, this view has little mathematical basis, as rank
correlation coefficients measure a different type of relationship
than the , and are best seen as
measures of a different type of , rather than as
alternative measure of the population correlation
coefficient.
To illustrate the nature of rank correlation, and its difference
from linear correlation, consider the following four pairs of
numbers (x,&y):
(0,&1), (10,&100),
(101,&500), (102,&2000).
As we go from each pair to the next pair x increases, and
so does y. This relationship is perfect, in the sense that
an increase in x is always accompanied by an increase
in&y. This means that we have a perfect
rank correlation, and both Spearman's and Kendall's correlation
coefficients are 1, whereas in this example Pearson product-moment
correlation coefficient is 0.7544, indicating that the points are
far from lying on a straight line. In the same way if y
always decreases when x increases, the rank
correlation coefficients will be &1, while the Pearson
product-moment correlation coefficient may or may not be close to
&1, depending on how close the points are to a straight line.
Although in the extreme cases of perfect rank correlation the two
coefficients are both equal (being both +1 or both &1) this is not
in general so, and values of the two coefficients cannot
meaningfully be compared.
For example, for the three pairs (1,&1)
(2,&3) (3,&2) Spearman's
coefficient is 1/2, while Kendall's coefficient
The degree of dependence between variables X and Y
does not depend on the scale on which the variables are expressed.
That is, if we are analyzing the relationship between X and
Y, most correlation measures are unaffected by transforming
X to a&+&bX
c&+&dY, where
a, b, c, and d are constants. This is
true of some correlation statistics as well as their population
analogues. Some correlation statistics, such as the rank
correlation coefficient, are also invariant to
of the marginal
distributions of X and/or Y.
correlation coefficients between X and Y are shown
when the two variables' ranges are unrestricted, and when the range
of X is restricted to the interval (0,1).
Most correlation measures are sensitive to the manner in which
X and Y are sampled. Dependencies tend to be stronger
if viewed over a wider range of values. Thus, if we consider the
correlation coefficient between the heights of fathers and their
sons over all adult males, and compare it to the same correlation
coefficient calculated when the fathers are selected to be between
165&cm and 170&cm in height, the
correlation will be weaker in the latter case. Several techniques
have been developed that attempt to correct for range restriction
in one or both variables, and are commonly used in meta-
the most common are Thorndike's case II and case III
equations.
Various correlation measures in use may be undefined for certain
joint distributions of X and Y. For example, the
Pearson correlation coefficient is defined in terms of , and hence will be undefined if
the moments are undefined. Measures of dependence based on
are always defined. Sample-based
statistics intended to estimate population measures of dependence
may or may not have desirable statistical properties such as being
, or , based on the
spatial structure of the population from which the data were
Sensitivity to the data distribution can be used to an
advantage. For example,
is designed to use the
sensitivity to the range in order to pick out correlations between
fast components of time series.
By reducing the range of values in a controlled manner, the
correlations on long time scale are filtered out and only the
correlations on short time scales are revealed.
Correlation matrices
The correlation matrix of n random variables
X1, ..., Xn is the
n& & &n matrix
whose i,j entry is
corr(Xi,&Xj).
If the measures of correlation used are product-moment
coefficients, the correlation matrix is the same as the
Xi / σ (Xi) for
i = 1,&...,&n. This
applies to both the matrix of population correlations (in which
case "σ" is the population standard deviation), and to the matrix
of sample correlations (in which case "σ" denotes the sample
standard deviation). Consequently, each is necessarily a .
The correlation matrix is symmetric because the correlation
between Xi and Xj
is the same as the correlation between Xj
Correlation and linearity
Four sets of data with the same correlation of 0.816
The Pearson correlation coefficient indicates the strength of a
linear relationship between two variables, but its value generally
does not completely characterize their relationship. In particular,
of Y given
X, denoted E(Y|X), is not linear in X,
the correlation coefficient will not fully determine the form of
The image on the right shows
of , a set of four
different pairs of variables created by .
The four y variables have the same mean (7.5), standard
deviation (4.12), correlation (0.816) and regression line
(y&=&3&+&0.5x).
However, as can be seen on the plots, the distribution of the
variables is very different. The first one (top left) seems to be
distributed normally, and corresponds to what one would expect when
considering two variables correlated and following the assumption
of normality. The second one (top right) is not distributed
while an obvious relationship between the two variables
can be observed, it is not linear. In this case the Pearson
correlation coefficient does not indicate that there is an exact
functional relationship: only the extent to which that relationship
can be approximated by a linear relationship. In the third case
(bottom left), the linear relationship is perfect, except for one
which exerts enough influence to lower the
correlation coefficient from 1 to 0.816. Finally, the fourth
example (bottom right) shows another example when one outlier is
enough to produce a high correlation coefficient, even though the
relationship between the two variables is not linear.
These examples indicate that the correlation coefficient, as a
summary statistic, cannot replace visual examination of the data.
Note that the examples are sometimes said to demonstrate that the
Pearson correlation assumes that the data follow a , but this is not
source: wikipedia
已投稿到:
以上网友发言只代表其个人观点,不代表新浪网的观点或立场。Correlation
Linear Correlation
Introduction
The news is filled with examples of correlations and associations:
Drinking a glass of red wine per day may decrease your chances of
a heart attack.
Taking one aspirin per day may decrease your chances of stroke or
of a heart attack.
Eating lots of certain kinds of fish may improve your health and
make you smarter.
Driving slower reduces your chances of getting killed in a traffic
Taller people tend to weigh more.
Pregnant women that smoke tend to have low birthweight babies.
Animals with large brains tend to be more intelligent.
The more you study for an exam, the higher the score you are likely
to receive.
The correlation, denoted by r, measures the amount of linear
association between two variables.
r is always between -1 and 1 inclusive.
The R-squared value, denoted by R2,
is the square of the correlation.
It measures the proportion of
variation in the dependent variable that can be attributed to
the independent variable.
The R-squared value R2 is always between
0 and 1 inclusive.
. The points are exactly
on the trend line.
Correlation r = 1; R-squared = 1.00
The points are close to the linear
trend line.
Correlation r = 0.9; R=squared = 0.81.
The points are far from the trend
Correlation r = 0.45; R-squared = 0.2025.
There is no association between the variables.
Correlation r = 0.0; R-squared = 0.0.
Correlation r = -0.3. R-squared = 0.09.
Correlation r = -0.95; R-squared = 0.9025
Correlation r = -1.
R-squared = 1.00.
How high must a correlation be to be considered meaningful?
It depends on the discipline.
Here are some rough guidelines:
between the variables.
There is a perfect quadratic relationship between
x and y, but the correlation is
A quadratic relationship between
x and y means that there is an equation y = ax2 + bx + c that
allows us to compute y from x. a, b, and c must be determined from the
Caution: & Outliers can distort the correlation:
Without the outlier, the correlation is 1;
with the outlier the correlation is 0.514.
Without the outlier, the correlation is 1;
with the outlier the correlation is 0.522.
Use SPSS to do continue the above analysis of
Compute the correletion
between meter and kilo.
Create a scatterplot with a linear regression line
(linear trend line) of meter (x-variable) and kilo (y-variable).
Repeat steps 1 and 2 after omitting the point that represents
William Perry.How to measure association I Correlation:如何衡量我相关协会to,协会,TO,Ho..
扫扫二维码,随身浏览文档
手机或平板扫扫即可继续访问
How to measure association I Correlation:如何衡量我相关协会
举报该文档为侵权文档。
举报该文档含有违规或不良信息。
反馈该文档无法正常浏览。
举报该文档为重复文档。
推荐理由:
将文档分享至:
分享完整地址
文档地址:
粘贴到BBS或博客
flash地址:
支持嵌入FLASH地址的网站使用
html代码:
&embed src='/DocinViewer-4.swf' width='100%' height='600' type=application/x-shockwave-flash ALLOWFULLSCREEN='true' ALLOWSCRIPTACCESS='always'&&/embed&
450px*300px480px*400px650px*490px
支持嵌入HTML代码的网站使用
您的内容已经提交成功
您所提交的内容需要审核后才能发布,请您等待!
3秒自动关闭窗口

我要回帖

更多关于 outlier analysis pdf 的文章

 

随机推荐