读《利用Python进行数据分析》之全局解释锁、NumPy效率与Conda

一、什么是全局解释器锁

当搭建高并发、多线程应用,尤其是多CPU绑定线程时,使用Python则会成为一项挑战。
原因在于Python拥有全局解释器锁(GIL),这是一种防止解释器同时执行多个Python指令的机制。

上面的内容引发一个思考:全局解释器锁(GIL)是什么情况?

从官网Global Interpreter Lock了解到:

In CPython, the global interpreter lock, or GIL, is a mutex that protects access to Python objects, preventing multiple threads from executing Python bytecodes at once.
This lock is necessary mainly because CPython’s memory management is not thread-safe.

解释型语言还有其他,诸如Ruby和JavaScript,但是像CPython解释器这样有个全局锁,应该只此一例吧。

这个GIL的弊端还挺多,例如:

The GIL is controversial because it prevents multithreaded CPython programs from taking full advantage of multiprocessor systems in certain situations.

Two threads calling a function may take twice as much time as a single thread calling the function twice.

The GIL can cause I/O-bound threads to be scheduled ahead of CPU-bound threads. And it prevents signals from being delivered.

之前使用Python只限于小打小闹,用于学习研究,既然这个问题这么臭名昭著,不知道刚刚发布的Python 3.8是不是仍然存在。

读到一篇不错的文章–大家好,给大家介绍下,这是Python GIL,它模拟出了问题场景。
我在Python 2.7上,重现了问题,但是在Python 3.8上,测试到两段代码的执行结果接近。

更多关于GIL的话题,参考如下链接:

二、NumPy的存储效率

对于数值数据,NumPy数组能够比Python内建数据结构更为高效地存储和操作数据。

上面内容的逻辑性在哪里?不只是我,我注意到网络上也有类似的疑问:Memory Efficiency of NumPy

What makes Numpy Arrays Fast: Memory and Strides

对用底层语言编写的库,可以在NumPy数组存储的数据上直接操作,而无须将数据复制到其它内存中后再操作。

这个听起来,与JNI中,内存在Java内存与Native内存之间传递有相似之处?Understanding the internals of NumPy to avoid unnecessary array copying,这篇文章给出了大致的解释。

比较遗憾,一时之间,我看完了这些资料,仍然是一头雾水,把它标记为TODO吧,后面的时间有机会再深入。

三、Conda VS Pip

在1.4.3章节,提到了通过Anaconda来安装各种库,那么Anaconda/conda是什么?用来解决什么问题?它们是怎么解决的?

在Anaconda的官方推特上,它是这么介绍自己的:

Anaconda is the world’s most popular and trusted Python/R platform for data science, machine learning, and AI.

Conda页面,又可以看到:

Conda is an open source package management system and environment management system that runs on Windows, macOS and Linux.
Conda quickly installs, runs and updates packages and their dependencies. Conda easily creates, saves, loads and switches between environments on your local computer.
It was created for Python programs, but it can package and distribute software for any language.

同时,可以注意到二者的联系在于:

The conda package and environment manager is included in all versions of Anaconda and Miniconda.

参考StackOverFlow上的一个回答,我认为非常到位:

conda is the package manager. Anaconda is a set of about a hundred packages including conda, numpy, scipy, ipython notebook, and so on.

围绕上面的内容,有两个启发:

首先,Python社区已经有了pip(Python Install Package),它有哪些不足,以致于催生了conda?
这个问题可以参考官方解释Understanding Conda and Pip

其次,在Conda页面,有下面一段话:

Package, dependency and environment management for any language

那么,推而广之,联系到Maven之于Java,npm之于JavaScript,甚至apt之于Debian/Ubuntu,
对于编程语言而言,包(组件、库)的管理,以及包依赖管理,对编程语言来讲,真是必不可少。
同时,开发环境管理,也是软件工程中必须考虑的因素。

本节更多参考见:

Leave a comment

Your comment