一个有用的Python可视化库yellowbrick-Anscombe's Quartet

发布时间:2021-12-03 公开文章

背景介绍

从学sklearn时,除了算法的坎要过,还得学习matplotlib可视化,对我的实践应用而言,可视化更重要一些,然而matplotlib的易用性和美观性确实不敢恭维。陆续使用过plotly、seaborn,最终定格在了Bokeh,因为它可以与Flask完美的结合,数据看板的开发难度降低了很多。

前阵子看到这个库可以较为便捷的实现数据探索,今天得空打算学习一下。原本访问的是英文文档,结果发现已经有人在做汉化,虽然看起来也像是谷歌翻译的,本着拿来主义,少费点精力的精神,就半抄半学,还是发现了一些与文档不太一致的地方。

 

# http://www.scikit-yb.org/zh/latest/api/anscombe.html

关于Anscombe's quartet

1973年,统计学家F.J. Anscombe构造出了四组奇特的数据。它告诉人们,在分析数据之前,描绘数据所对应的图像有多么的重要。 这四组数据中,x值的平均数都是9.0,y值的平均数都是7.5;x值的方差都是10.0,y值的方差都是3.75;它们的相关度都是0.816,线性回归线都是y=3+0.5x。单从这些统计数字上看来,四组数据所反映出的实际情况非常相近,而事实上,这四组数据有着天壤之别。

把它们描绘在图表中,你会发现这四组数据是四种完全不同的情况。第一组数据是大多人看到上述统计数字的第一反应,是最“正常”的一组数据;第二组数据所反映的事实上是一个精确的二次函数关系,只是在错误地应用了线性模型后,各项统计数字与第一组数据恰好都相同;第三组数据描述的是一个精确的线性关系,只是这里面有一个异常值,它导致了上述各个统计数字,尤其是相关度值的偏差;第四组数据则是一个更极端的例子,其异常值导致了平均数、方差、相关度、线性回归线等所有统计数字全部发生偏差。

import numpy as np
import matplotlib.pyplot as plt

from yellowbrick.bestfit import draw_best_fit
from yellowbrick.style import get_color_cycle


##########################################################################
## Anscombe Data Arrays
##########################################################################

ANSCOMBE = [
    np.array([
        [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0],
        [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
    ]),
    np.array([
        [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0],
        [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
    ]),
    np.array([
        [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0],
        [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
    ]),
    np.array([
        [8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 19.0, 8.0, 8.0, 8.0],
        [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]
    ])
]


def anscombe():
    """
    Creates 2x2 grid plot of the 4 anscombe datasets for illustration.
    """
    fig, ((axa, axb), (axc, axd)) =  plt.subplots(2, 2, sharex='col', sharey='row')
    colors = get_color_cycle()
    for arr, ax, color in zip(ANSCOMBE, (axa, axb, axc, axd), colors):
        x = arr[0]
        y = arr[1]

        # Set the X and Y limits
        ax.set_xlim(0, 15)
        ax.set_ylim(0, 15)

        # Draw the points in the scatter plot
        ax.scatter(x, y, c=color)

        # Draw the linear best fit line on the plot
        draw_best_fit(x, y, ax, c=color)

    return (axa, axb, axc, axd)
anscombe()
plt.show()
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
 
 
仅限100名!3.9元入门python。游戏闯关式教学,小白也能轻松学会!
已失效 
get_color_cycle()  # 大爷的,这种配色也是腻害!
[(0.00784313725490196, 0.4470588235294118, 0.6352941176470588),
 (0.6235294117647059, 0.7647058823529411, 0.4666666666666667),
 (0.792156862745098, 0.043137254901960784, 0.011764705882352941),
 (0.6470588235294118, 0.00784313725490196, 0.34509803921568627),
 (0.8431372549019608, 0.7803921568627451, 0.011764705882352941),
 (0.5333333333333333, 0.792156862745098, 0.8549019607843137)]