Feature extraction


Module Sklearn.​Feature_extraction.​DictVectorizer wraps Python class sklearn.feature_extraction.DictVectorizer.

type t


constructor and attributes create
val create :
  ?dtype:Np.Dtype.t ->
  ?separator:string ->
  ?sparse:bool ->
  ?sort:bool ->
  unit ->

Transforms lists of feature-value mappings to vectors.

This transformer turns lists of mappings (dict-like objects) of feature names to feature values into Numpy arrays or scipy.sparse matrices for use with scikit-learn estimators.

When feature values are strings, this transformer will do a binary one-hot (aka one-of-K) coding: one boolean-valued feature is constructed for each of the possible string values that the feature can take on. For instance, a feature 'f' that can take on the values 'ham' and 'spam' will become two features in the output, one signifying 'f=ham', the other 'f=spam'.

However, note that this transformer will only do a binary one-hot encoding when feature values are of type string. If categorical features are represented as numeric values such as int, the DictVectorizer can be followed by :class:sklearn.preprocessing.OneHotEncoder to complete binary one-hot encoding.

Features that do not occur in a sample (mapping) will have a zero value in the resulting array/matrix.

Read more in the :ref:User Guide <dict_feature_extraction>.


  • dtype : dtype, default=np.float64 The type of feature values. Passed to Numpy array/scipy.sparse matrix constructors as the dtype argument.

  • separator : str, default='=' Separator string used when constructing new features for one-hot coding.

  • sparse : bool, default=True Whether transform should produce scipy.sparse matrices.

  • sort : bool, default=True Whether feature_names_ and vocabulary_ should be sorted when fitting.


  • vocabulary_ : dict A dictionary mapping feature names to feature indices.

  • feature_names_ : list A list of length n_features containing the feature names (e.g., 'f=ham' and 'f=spam').


>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer(sparse=False)
>>> D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
>>> X = v.fit_transform(D)
>>> X
array([[2., 0., 1.],
       [0., 1., 3.]])
>>> v.inverse_transform(X) ==         [{'bar': 2.0, 'foo': 1.0}, {'baz': 1.0, 'foo': 3.0}]
>>> v.transform({'foo': 4, 'unseen_feature': 3})
array([[0., 0., 4.]])

See also

  • FeatureHasher : performs vectorization using only a hash function.

  • sklearn.preprocessing.OrdinalEncoder : handles nominal/categorical features encoded as columns of arbitrary data types.


method fit
val fit :
  ?y:Py.Object.t ->
  x:Py.Object.t ->
  [> tag] Obj.t ->

Learn a list of feature name -> indices mappings.


  • X : Mapping or iterable over Mappings Dict(s) or Mapping(s) from feature names (arbitrary Python objects) to feature values (strings or convertible to dtype).

  • y : (ignored)




method fit_transform
val fit_transform :
  ?y:Py.Object.t ->
  x:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->
  [>`ArrayLike] Np.Obj.t

Learn a list of feature name -> indices mappings and transform X.

Like fit(X) followed by transform(X), but does not require materializing X in memory.


  • X : Mapping or iterable over Mappings Dict(s) or Mapping(s) from feature names (arbitrary Python objects) to feature values (strings or convertible to dtype).

  • y : (ignored)


  • Xa : {array, sparse matrix} Feature vectors; always 2-d.


method get_feature_names
val get_feature_names :
  [> tag] Obj.t ->

Returns a list of feature names, ordered by their indices.

If one-of-K coding is applied to categorical features, this will include the constructed feature names but not the original ones.


method get_params
val get_params :
  ?deep:bool ->
  [> tag] Obj.t ->

Get parameters for this estimator.


  • deep : bool, default=True If True, will return the parameters for this estimator and contained subobjects that are estimators.


  • params : mapping of string to any Parameter names mapped to their values.


method inverse_transform
val inverse_transform :
  ?dict_type:Np.Dtype.t ->
  x:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->

Transform array or sparse matrix X back to feature mappings.

X must have been produced by this DictVectorizer's transform or fit_transform method; it may only have passed through transformers that preserve the number of features and their order.

In the case of one-hot/one-of-K coding, the constructed feature names and values are returned rather than the original ones.


  • X : {array-like, sparse matrix} of shape (n_samples, n_features) Sample matrix.

  • dict_type : type, default=dict Constructor for feature mappings. Must conform to the collections.Mapping API.


  • D : list of dict_type objects of shape (n_samples,) Feature mappings for the samples in X.


method restrict
val restrict :
  ?indices:bool ->
  support:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->

Restrict the features to those in support using feature selection.

This function modifies the estimator in-place.


  • support : array-like Boolean mask or list of indices (as returned by the get_support member of feature selectors).

  • indices : bool, default=False Whether support is a list of indices.




>>> from sklearn.feature_extraction import DictVectorizer
>>> from sklearn.feature_selection import SelectKBest, chi2
>>> v = DictVectorizer()
>>> D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
>>> X = v.fit_transform(D)
>>> support = SelectKBest(chi2, k=2).fit(X, [0, 1])
>>> v.get_feature_names()
['bar', 'baz', 'foo']
>>> v.restrict(support.get_support())
>>> v.get_feature_names()
['bar', 'foo']


method set_params
val set_params :
  ?params:(string * Py.Object.t) list ->
  [> tag] Obj.t ->

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.


  • **params : dict Estimator parameters.


  • self : object Estimator instance.


method transform
val transform :
  x:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->
  [>`ArrayLike] Np.Obj.t

Transform feature->value dicts to array or sparse matrix.

Named features not encountered during fit or fit_transform will be silently ignored.


  • X : Mapping or iterable over Mappings of shape (n_samples,) Dict(s) or Mapping(s) from feature names (arbitrary Python objects) to feature values (strings or convertible to dtype).


  • Xa : {array, sparse matrix} Feature vectors; always 2-d.


attribute vocabulary_
val vocabulary_ : t -> Dict.t
val vocabulary_opt : t -> (Dict.t) option

This attribute is documented in create above. The first version raises Not_found if the attribute is None. The _opt version returns an option.


attribute feature_names_
val feature_names_ : t -> [>`ArrayLike] Np.Obj.t
val feature_names_opt : t -> ([>`ArrayLike] Np.Obj.t) option

This attribute is documented in create above. The first version raises Not_found if the attribute is None. The _opt version returns an option.


Module Sklearn.​Feature_extraction.​FeatureHasher wraps Python class sklearn.feature_extraction.FeatureHasher.

type t


constructor and attributes create
val create :
  ?n_features:int ->
  ?input_type:[`Dict | `Pair] ->
  ?dtype:Np.Dtype.t ->
  ?alternate_sign:bool ->
  unit ->

Implements feature hashing, aka the hashing trick.

This class turns sequences of symbolic feature names (strings) into scipy.sparse matrices, using a hash function to compute the matrix column corresponding to a name. The hash function employed is the signed 32-bit version of Murmurhash3.

Feature names of type byte string are used as-is. Unicode strings are converted to UTF-8 first, but no Unicode normalization is done. Feature values must be (finite) numbers.

This class is a low-memory alternative to DictVectorizer and CountVectorizer, intended for large-scale (online) learning and situations where memory is tight, e.g. when running prediction code on embedded devices.

Read more in the :ref:User Guide <feature_hashing>.

.. versionadded:: 0.13


  • n_features : int, default=220** The number of features (columns) in the output matrices. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.

  • input_type : {'dict', 'pair'}, default='dict' Either 'dict' (the default) to accept dictionaries over (feature_name, value); 'pair' to accept pairs of (feature_name, value); or 'string' to accept single strings. feature_name should be a string, while value should be a number. In the case of 'string', a value of 1 is implied. The feature_name is hashed to find the appropriate column for the feature. The value's sign might be flipped in the output (but see non_negative, below).

  • dtype : numpy dtype, default=np.float64 The type of feature values. Passed to scipy.sparse matrix constructors as the dtype argument. Do not set this to bool, np.boolean or any unsigned integer type.

  • alternate_sign : bool, default=True When True, an alternating sign is added to the features as to approximately conserve the inner product in the hashed space even for small n_features. This approach is similar to sparse random projection.

.. versionchanged:: 0.19 alternate_sign replaces the now deprecated non_negative parameter.


>>> from sklearn.feature_extraction import FeatureHasher
>>> h = FeatureHasher(n_features=10)
>>> D = [{'dog': 1, 'cat':2, 'elephant':4},{'dog': 2, 'run': 5}]
>>> f = h.transform(D)
>>> f.toarray()
array([[ 0.,  0., -4., -1.,  0.,  0.,  0.,  0.,  0.,  2.],
       [ 0.,  0.,  0., -2., -5.,  0.,  0.,  0.,  0.,  0.]])

See also

  • DictVectorizer : vectorizes string-valued features using a hash table.

  • sklearn.preprocessing.OneHotEncoder : handles nominal/categorical features.


method fit
val fit :
  ?x:[>`ArrayLike] Np.Obj.t ->
  ?y:Py.Object.t ->
  [> tag] Obj.t ->


This method doesn't do anything. It exists purely for compatibility with the scikit-learn transformer API.


  • X : ndarray


  • self : FeatureHasher


method fit_transform
val fit_transform :
  ?y:[>`ArrayLike] Np.Obj.t ->
  ?fit_params:(string * Py.Object.t) list ->
  x:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->
  [>`ArrayLike] Np.Obj.t

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.


  • X : {array-like, sparse matrix, dataframe} of shape (n_samples, n_features)

  • y : ndarray of shape (n_samples,), default=None Target values.

  • **fit_params : dict Additional fit parameters.


  • X_new : ndarray array of shape (n_samples, n_features_new) Transformed array.


method get_params
val get_params :
  ?deep:bool ->
  [> tag] Obj.t ->

Get parameters for this estimator.


  • deep : bool, default=True If True, will return the parameters for this estimator and contained subobjects that are estimators.


  • params : mapping of string to any Parameter names mapped to their values.


method set_params
val set_params :
  ?params:(string * Py.Object.t) list ->
  [> tag] Obj.t ->

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.


  • **params : dict Estimator parameters.


  • self : object Estimator instance.


method transform
val transform :
  raw_X:Py.Object.t ->
  [> tag] Obj.t ->
  [>`ArrayLike] Np.Obj.t

Transform a sequence of instances to a scipy.sparse matrix.


  • raw_X : iterable over iterable over raw features, length = n_samples Samples. Each sample must be iterable an (e.g., a list or tuple) containing/generating feature names (and optionally values, see the input_type constructor argument) which will be hashed. raw_X need not support the len function, so it can be the result of a generator; n_samples is determined on the fly.


  • X : sparse matrix of shape (n_samples, n_features) Feature matrix, for use with estimators or further transformers.


Module Sklearn.​Feature_extraction.​Image wraps Python module sklearn.feature_extraction.image.


Module Sklearn.​Feature_extraction.​Image.​PatchExtractor wraps Python class sklearn.feature_extraction.image.PatchExtractor.

type t


constructor and attributes create
val create :
  ?patch_size:Py.Object.t ->
  ?max_patches:[`F of float | `I of int] ->
  ?random_state:int ->
  unit ->

Extracts patches from a collection of images

Read more in the :ref:User Guide <image_feature_extraction>.

.. versionadded:: 0.9


  • patch_size : tuple of int (patch_height, patch_width) The dimensions of one patch.

  • max_patches : int or float, default=None The maximum number of patches per image to extract. If max_patches is a float in (0, 1), it is taken to mean a proportion of the total number of patches.

  • random_state : int, RandomState instance, default=None Determines the random number generator used for random sampling when max_patches is not None. Use an int to make the randomness deterministic.

  • See :term:Glossary <random_state>.


>>> from sklearn.datasets import load_sample_images
>>> from sklearn.feature_extraction import image
>>> # Use the array data from the second image in this dataset:
>>> X = load_sample_images().images[1]
>>> print('Image shape: {}'.format(X.shape))
Image shape: (427, 640, 3)
>>> pe = image.PatchExtractor(patch_size=(2, 2))
>>> pe_fit =
>>> pe_trans = pe.transform(X)
>>> print('Patches shape: {}'.format(pe_trans.shape))
Patches shape: (545706, 2, 2)


method fit
val fit :
  ?y:Py.Object.t ->
  x:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->

Do nothing and return the estimator unchanged.

This method is just there to implement the usual API and hence work in pipelines.


  • X : array-like of shape (n_samples, n_features) Training data.


method get_params
val get_params :
  ?deep:bool ->
  [> tag] Obj.t ->

Get parameters for this estimator.


  • deep : bool, default=True If True, will return the parameters for this estimator and contained subobjects that are estimators.


  • params : mapping of string to any Parameter names mapped to their values.


method set_params
val set_params :
  ?params:(string * Py.Object.t) list ->
  [> tag] Obj.t ->

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.


  • **params : dict Estimator parameters.


  • self : object Estimator instance.


method transform
val transform :
  x:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->
  [>`ArrayLike] Np.Obj.t

Transforms the image samples in X into a matrix of patch data.


  • X : ndarray of shape (n_samples, image_height, image_width) or (n_samples, image_height, image_width, n_channels) Array of images from which to extract patches. For color images, the last dimension specifies the channel: a RGB image would have n_channels=3.


  • patches : array of shape (n_patches, patch_height, patch_width) or (n_patches, patch_height, patch_width, n_channels) The collection of patches extracted from the images, where n_patches is either n_samples * max_patches or the total number of patches that can be extracted.


Module Sklearn.​Feature_extraction.​Image.​Product wraps Python class sklearn.feature_extraction.image.product.

type t


constructor and attributes create
val create :
  ?repeat:Py.Object.t ->
  Py.Object.t list ->

product( *iterables, repeat=1) --> product object

Cartesian product of input iterables. Equivalent to nested for-loops.

For example, product(A, B) returns the same as: ((x,y) for x in A for y in B). The leftmost iterators are in the outermost for-loop, so the output tuples cycle in a manner similar to an odometer (with the rightmost element changing on every iteration).

To compute the product of an iterable with itself, specify the number of repetitions with the optional repeat keyword argument. For example, product(A, repeat=4) means the same as product(A, A, A, A).

product('ab', range(3)) --> ('a',0) ('a',1) ('a',2) ('b',0) ('b',1) ('b',2) product((0,1), (0,1), (0,1)) --> (0,0,0) (0,0,1) (0,1,0) (0,1,1) (1,0,0) ...


method iter
val iter :
  [> tag] Obj.t ->
  Dict.t Seq.t

Implement iter(self).


function as_strided
val as_strided :
  ?shape:int list ->
  ?strides:Py.Object.t ->
  ?subok:bool ->
  ?writeable:bool ->
  x:[>`ArrayLike] Np.Obj.t ->
  unit ->
  [>`ArrayLike] Np.Obj.t

Create a view into the array with the given shape and strides.

.. warning:: This function has to be used with extreme care, see notes.


  • x : ndarray Array to create a new.

  • shape : sequence of int, optional The shape of the new array. Defaults to x.shape.

  • strides : sequence of int, optional The strides of the new array. Defaults to x.strides.

  • subok : bool, optional .. versionadded:: 1.10

    If True, subclasses are preserved.

  • writeable : bool, optional .. versionadded:: 1.12

    If set to False, the returned array will always be readonly. Otherwise it will be writable if the original array was. It is advisable to set this to False if possible (see Notes).


  • view : ndarray

See also

  • broadcast_to: broadcast an array to a given shape.

  • reshape : reshape an array.


as_strided creates a view into the array given the exact strides and shape. This means it manipulates the internal data structure of ndarray and, if done incorrectly, the array elements can point to invalid memory and can corrupt results or crash your program. It is advisable to always use the original x.strides when calculating new strides to avoid reliance on a contiguous memory layout.

Furthermore, arrays created with this function often contain self overlapping memory, so that two elements are identical. Vectorized write operations on such arrays will typically be unpredictable. They may even give different results for small, large, or transposed arrays. Since writing to these arrays has to be tested and done with great care, you may want to use writeable=False to avoid accidental write operations.

For these reasons it is advisable to avoid as_strided when possible.


function check_array
val check_array :
  ?accept_sparse:[`S of string | `StringList of string list | `Bool of bool] ->
  ?accept_large_sparse:bool ->
  ?dtype:[`Dtypes of Np.Dtype.t list | `S of string | `Dtype of Np.Dtype.t | `None] ->
  ?order:[`F | `C] ->
  ?copy:bool ->
  ?force_all_finite:[`Allow_nan | `Bool of bool] ->
  ?ensure_2d:bool ->
  ?allow_nd:bool ->
  ?ensure_min_samples:int ->
  ?ensure_min_features:int ->
  ?estimator:[>`BaseEstimator] Np.Obj.t ->
  array:Py.Object.t ->
  unit ->

Input validation on an array, list, sparse matrix or similar.

By default, the input is checked to be a non-empty 2D array containing only finite values. If the dtype of the array is object, attempt converting to float, raising on failure.


  • array : object Input object to check / convert.

  • accept_sparse : string, boolean or list/tuple of strings (default=False) String[s] representing allowed sparse matrix formats, such as 'csc', 'csr', etc. If the input is sparse but not in the allowed format, it will be converted to the first listed format. True allows the input to be any format. False means that a sparse matrix input will raise an error.

  • accept_large_sparse : bool (default=True) If a CSR, CSC, COO or BSR sparse matrix is supplied and accepted by accept_sparse, accept_large_sparse=False will cause it to be accepted only if its indices are stored with a 32-bit dtype.

    .. versionadded:: 0.20

  • dtype : string, type, list of types or None (default='numeric') Data type of result. If None, the dtype of the input is preserved. If 'numeric', dtype is preserved unless array.dtype is object. If dtype is a list of types, conversion on the first type is only performed if the dtype of the input is not in the list.

  • order : 'F', 'C' or None (default=None) Whether an array will be forced to be fortran or c-style. When order is None (default), then if copy=False, nothing is ensured about the memory layout of the output array; otherwise (copy=True) the memory layout of the returned array is kept as close as possible to the original array.

  • copy : boolean (default=False) Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

  • force_all_finite : boolean or 'allow-nan', (default=True) Whether to raise an error on np.inf, np.nan, pd.NA in array. The possibilities are:

    • True: Force all values of array to be finite.
    • False: accepts np.inf, np.nan, pd.NA in array.
    • 'allow-nan': accepts only np.nan and pd.NA values in array. Values cannot be infinite.

    .. versionadded:: 0.20 force_all_finite accepts the string 'allow-nan'.

    .. versionchanged:: 0.23 Accepts pd.NA and converts it into np.nan

  • ensure_2d : boolean (default=True) Whether to raise a value error if array is not 2D.

  • allow_nd : boolean (default=False) Whether to allow array.ndim > 2.

  • ensure_min_samples : int (default=1) Make sure that the array has a minimum number of samples in its first axis (rows for a 2D array). Setting to 0 disables this check.

  • ensure_min_features : int (default=1) Make sure that the 2D array has some minimum number of features (columns). The default value of 1 rejects empty datasets. This check is only enforced when the input data has effectively 2 dimensions or is originally 1D and ensure_2d is True. Setting to 0 disables this check.

  • estimator : str or estimator instance (default=None) If passed, include the name of the estimator in warning messages.


  • array_converted : object The converted and validated array.


function check_random_state
val check_random_state :
  [`Optional of [`I of int | `None] | `RandomState of Py.Object.t] ->

Turn seed into a np.random.RandomState instance


  • seed : None | int | instance of RandomState If seed is None, return the RandomState singleton used by np.random. If seed is an int, return a new RandomState instance seeded with seed. If seed is already a RandomState instance, return it. Otherwise raise ValueError.


function extract_patches
val extract_patches :
  ?patch_shape:[`Tuple of Py.Object.t | `I of int] ->
  ?extraction_step:[`Tuple of Py.Object.t | `I of int] ->
  arr:[>`ArrayLike] Np.Obj.t ->
  unit ->
  • DEPRECATED: The function feature_extraction.image.extract_patches has been deprecated in 0.22 and will be removed in 0.24.

Extracts patches of any n-dimensional array in place using strides.

Given an n-dimensional array it will return a 2n-dimensional array with the first n dimensions indexing patch position and the last n indexing the patch content. This operation is immediate (O(1)). A reshape performed on the first n dimensions will cause numpy to copy data, leading to a list of extracted patches.

Read more in the :ref:User Guide <image_feature_extraction>.


  • arr : ndarray n-dimensional array of which patches are to be extracted

  • patch_shape : int or tuple of length arr.ndim, default=8 Indicates the shape of the patches to be extracted. If an integer is given, the shape will be a hypercube of sidelength given by its value.

  • extraction_step : int or tuple of length arr.ndim, default=1 Indicates step size at which extraction shall be performed. If integer is given, then the step is uniform in all dimensions.


  • patches : strided ndarray 2n-dimensional array indexing patches on first n dimensions and containing patches on the last n dimensions. These dimensions are fake, but this way no data is copied. A simple reshape invokes a copying operation to obtain a list of patches: result.reshape([-1] + list(patch_shape))


function extract_patches_2d
val extract_patches_2d :
  ?max_patches:[`F of float | `I of int] ->
  ?random_state:int ->
  image:[>`ArrayLike] Np.Obj.t ->
  patch_size:Py.Object.t ->
  unit ->
  [>`ArrayLike] Np.Obj.t

Reshape a 2D image into a collection of patches

The resulting patches are allocated in a dedicated array.

Read more in the :ref:User Guide <image_feature_extraction>.


  • image : ndarray of shape (image_height, image_width) or (image_height, image_width, n_channels) The original image data. For color images, the last dimension specifies the channel: a RGB image would have n_channels=3.

  • patch_size : tuple of int (patch_height, patch_width) The dimensions of one patch.

  • max_patches : int or float, default=None The maximum number of patches to extract. If max_patches is a float between 0 and 1, it is taken to be a proportion of the total number of patches.

  • random_state : int, RandomState instance, default=None Determines the random number generator used for random sampling when max_patches is not None. Use an int to make the randomness deterministic.

  • See :term:Glossary <random_state>.


  • patches : array of shape (n_patches, patch_height, patch_width) or (n_patches, patch_height, patch_width, n_channels) The collection of patches extracted from the image, where n_patches is either max_patches or the total number of patches that can be extracted.


>>> from sklearn.datasets import load_sample_image
>>> from sklearn.feature_extraction import image
>>> # Use the array data from the first image in this dataset:
>>> one_image = load_sample_image('china.jpg')
>>> print('Image shape: {}'.format(one_image.shape))
Image shape: (427, 640, 3)
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> print('Patches shape: {}'.format(patches.shape))
Patches shape: (272214, 2, 2, 3)
>>> # Here are just two of these patches:
>>> print(patches[1])
[[[174 201 231]
  [174 201 231]]
 [[173 200 230]
  [173 200 230]]]
>>> print(patches[800])
[[[187 214 243]
  [188 215 244]]
 [[187 214 243]
  [188 215 244]]]


function grid_to_graph
val grid_to_graph :
  ?n_z:int ->
  ?mask:[`Arr of [>`ArrayLike] Np.Obj.t | `Dtype_bool of Py.Object.t] ->
  ?return_as:Py.Object.t ->
  ?dtype:Np.Dtype.t ->
  n_x:int ->
  n_y:int ->
  unit ->

Graph of the pixel-to-pixel connections

Edges exist if 2 voxels are connected.


  • n_x : int Dimension in x axis

  • n_y : int Dimension in y axis

  • n_z : int, default=1 Dimension in z axis

  • mask : ndarray of shape (n_x, n_y, n_z), dtype=bool, default=None An optional mask of the image, to consider only part of the pixels.

  • return_as : np.ndarray or a sparse matrix class, default=sparse.coo_matrix The class to use to build the returned adjacency matrix.

  • dtype : dtype, default=int The data of the returned sparse matrix. By default it is int


For scikit-learn versions 0.14.1 and prior, return_as=np.ndarray was handled by returning a dense np.matrix instance. Going forward, np.ndarray returns an np.ndarray, as expected.

For compatibility, user code relying on this method should wrap its calls in np.asarray to avoid type issues.


function img_to_graph
val img_to_graph :
  ?mask:[`Arr of [>`ArrayLike] Np.Obj.t | `Dtype_bool of Py.Object.t] ->
  ?return_as:Py.Object.t ->
  ?dtype:Np.Dtype.t ->
  img:[>`ArrayLike] Np.Obj.t ->
  unit ->

Graph of the pixel-to-pixel gradient connections

Edges are weighted with the gradient values.

Read more in the :ref:User Guide <image_feature_extraction>.


  • img : ndarray of shape (height, width) or (height, width, channel) 2D or 3D image.

  • mask : ndarray of shape (height, width) or (height, width, channel), dtype=bool, default=None An optional mask of the image, to consider only part of the pixels.

  • return_as : np.ndarray or a sparse matrix class, default=sparse.coo_matrix The class to use to build the returned adjacency matrix.

  • dtype : dtype, default=None The data of the returned sparse matrix. By default it is the dtype of img


For scikit-learn versions 0.14.1 and prior, return_as=np.ndarray was handled by returning a dense np.matrix instance. Going forward, np.ndarray returns an np.ndarray, as expected.

For compatibility, user code relying on this method should wrap its calls in np.asarray to avoid type issues.


function reconstruct_from_patches_2d
val reconstruct_from_patches_2d :
  patches:[>`ArrayLike] Np.Obj.t ->
  image_size:Py.Object.t ->
  unit ->

Reconstruct the image from all of its patches.

Patches are assumed to overlap and the image is constructed by filling in the patches from left to right, top to bottom, averaging the overlapping regions.

Read more in the :ref:User Guide <image_feature_extraction>.


  • patches : ndarray of shape (n_patches, patch_height, patch_width) or (n_patches, patch_height, patch_width, n_channels) The complete set of patches. If the patches contain colour information, channels are indexed along the last dimension: RGB patches would have n_channels=3.

  • image_size : tuple of int (image_height, image_width) or (image_height, image_width, n_channels) The size of the image that will be reconstructed.


  • image : ndarray of shape image_size The reconstructed image.


Module Sklearn.​Feature_extraction.​Text wraps Python module sklearn.feature_extraction.text.


Module Sklearn.​Feature_extraction.​Text.​CountVectorizer wraps Python class sklearn.feature_extraction.text.CountVectorizer.

type t


constructor and attributes create
val create :
  ?input:[`Filename | `File | `Content] ->
  ?encoding:string ->
  ?decode_error:[`Strict | `Ignore | `Replace] ->
  ?strip_accents:[`Ascii | `Unicode] ->
  ?lowercase:bool ->
  ?preprocessor:Py.Object.t ->
  ?tokenizer:Py.Object.t ->
  ?stop_words:[`Arr of [>`ArrayLike] Np.Obj.t | `English] ->
  ?token_pattern:string ->
  ?ngram_range:Py.Object.t ->
  ?analyzer:[`Callable of Py.Object.t | `S of string | `Char | `PyObject of Py.Object.t] ->
  ?max_df:[`F of float | `I of int] ->
  ?min_df:[`F of float | `I of int] ->
  ?max_features:int ->
  ?vocabulary:[`Arr of [>`ArrayLike] Np.Obj.t | `Mapping of Py.Object.t] ->
  ?binary:bool ->
  ?dtype:Np.Dtype.t ->
  unit ->

Convert a collection of text documents to a matrix of token counts

This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

If you do not provide an a-priori dictionary and you do not use an analyzer that does some kind of feature selection then the number of features will be equal to the vocabulary size found by analyzing the data.

Read more in the :ref:User Guide <text_feature_extraction>.


  • input : string {'filename', 'file', 'content'}, default='content' If 'filename', the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze.

    If 'file', the sequence items must have a 'read' method (file-like object) that is called to fetch the bytes in memory.

    Otherwise the input is expected to be a sequence of items that can be of type string or byte.

  • encoding : string, default='utf-8' If bytes or files are given to analyze, this encoding is used to decode.

  • decode_error : {'strict', 'ignore', 'replace'}, default='strict' Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'.

  • strip_accents : {'ascii', 'unicode'}, default=None Remove accents and perform other character normalization during the preprocessing step. 'ascii' is a fast method that only works on characters that have an direct ASCII mapping. 'unicode' is a slightly slower method that works on any characters. None (default) does nothing.

    Both 'ascii' and 'unicode' use NFKD normalization from :func:unicodedata.normalize.

  • lowercase : bool, default=True Convert all characters to lowercase before tokenizing.

  • preprocessor : callable, default=None Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. Only applies if analyzer is not callable.

  • tokenizer : callable, default=None Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer == 'word'.

  • stop_words : string {'english'}, list, default=None If 'english', a built-in stop word list for English is used. There are several known issues with 'english' and you should consider an alternative (see :ref:stop_words).

    If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.

    If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

  • token_pattern : string Regular expression denoting what constitutes a 'token', only used if analyzer == 'word'. The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

  • ngram_range : tuple (min_n, max_n), default=(1, 1) The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.

  • analyzer : string, {'word', 'char', 'char_wb'} or callable, default='word' Whether the feature should be made of word n-gram or character n-grams. Option 'char_wb' creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

    If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.

    .. versionchanged:: 0.21

    Since v0.21, if input is filename or file, the data is first read from the file and then passed to the given callable analyzer.

  • max_df : float in range [0.0, 1.0] or int, default=1.0 When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

  • min_df : float in range [0.0, 1.0] or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

  • max_features : int, default=None If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

    This parameter is ignored if vocabulary is not None.

  • vocabulary : Mapping or iterable, default=None Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents. Indices in the mapping should not be repeated and should not have any gap between 0 and the largest index.

  • binary : bool, default=False If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.

  • dtype : type, default=np.int64 Type of the matrix returned by fit_transform() or transform().


  • vocabulary_ : dict A mapping of terms to feature indices.

  • fixed_vocabulary_: boolean True if a fixed vocabulary of term to indices mapping is provided by the user

  • stop_words_ : set Terms that were ignored because they either:

    • occurred in too many documents (max_df)
    • occurred in too few documents (min_df)
    • were cut off by feature selection (max_features).

    This is only available if no vocabulary was given.


>>> from sklearn.feature_extraction.text import CountVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = CountVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.toarray())
[[0 1 1 1 0 0 1 0 1]
 [0 2 0 1 0 1 1 0 1]
 [1 0 0 1 1 0 1 1 1]
 [0 1 1 1 0 0 1 0 1]]
>>> vectorizer2 = CountVectorizer(analyzer='word', ngram_range=(2, 2))
>>> X2 = vectorizer2.fit_transform(corpus)
>>> print(vectorizer2.get_feature_names())
['and this', 'document is', 'first document', 'is the', 'is this',
'second document', 'the first', 'the second', 'the third', 'third one',
 'this document', 'this is', 'this the']
 >>> print(X2.toarray())
 [[0 0 1 1 0 0 1 0 0 0 0 1 0]
 [0 1 0 1 0 1 0 1 0 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 1 0 1 0]
 [0 0 1 0 1 0 1 0 0 0 0 0 1]]

See Also

HashingVectorizer, TfidfVectorizer


The stop_words_ attribute can get large and increase the model size when pickling. This attribute is provided only for introspection and can be safely removed using delattr or set to None before pickling.


method build_analyzer
val build_analyzer :
  [> tag] Obj.t ->

Return a callable that handles preprocessing, tokenization and n-grams generation.


  • analyzer: callable A function to handle preprocessing, tokenization and n-grams generation.


method build_preprocessor
val build_preprocessor :
  [> tag] Obj.t ->

Return a function to preprocess the text before tokenization.


  • preprocessor: callable A function to preprocess the text before tokenization.


method build_tokenizer
val build_tokenizer :
  [> tag] Obj.t ->

Return a function that splits a string into a sequence of tokens.


  • tokenizer: callable A function to split a string into a sequence of tokens.


method decode
val decode :
  doc:string ->
  [> tag] Obj.t ->

Decode the input into a string of unicode symbols.

The decoding strategy depends on the vectorizer parameters.


  • doc : str The string to decode.


  • doc: str A string of unicode symbols.


method fit
val fit :
  ?y:Py.Object.t ->
  raw_documents:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->

Learn a vocabulary dictionary of all tokens in the raw documents.


  • raw_documents : iterable An iterable which yields either str, unicode or file objects.




method fit_transform
val fit_transform :
  ?y:Py.Object.t ->
  raw_documents:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->
  [>`ArrayLike] Np.Obj.t

Learn the vocabulary dictionary and return document-term matrix.

This is equivalent to fit followed by transform, but more efficiently implemented.


  • raw_documents : iterable An iterable which yields either str, unicode or file objects.


  • X : array of shape (n_samples, n_features) Document-term matrix.


method get_feature_names
val get_feature_names :
  [> tag] Obj.t ->
  string list

Array mapping from feature integer indices to feature name.


  • feature_names : list A list of feature names.


method get_params
val get_params :
  ?deep:bool ->
  [> tag] Obj.t ->

Get parameters for this estimator.


  • deep : bool, default=True If True, will return the parameters for this estimator and contained subobjects that are estimators.


  • params : mapping of string to any Parameter names mapped to their values.


method get_stop_words
val get_stop_words :
  [> tag] Obj.t ->
  [>`ArrayLike] Np.Obj.t option

Build or fetch the effective stop words list.


  • stop_words: list or None A list of stop words.


method inverse_transform
val inverse_transform :
  x:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->

Return terms per document with nonzero entries in X.


  • X : {array-like, sparse matrix} of shape (n_samples, n_features) Document-term matrix.


  • X_inv : list of arrays of shape (n_samples,) List of arrays of terms.


method set_params
val set_params :
  ?params:(string * Py.Object.t) list ->
  [> tag] Obj.t ->

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.


  • **params : dict Estimator parameters.


  • self : object Estimator instance.


method transform
val transform :
  raw_documents:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->
  [>`ArrayLike] Np.Obj.t

Transform documents to document-term matrix.

Extract token counts out of raw text documents using the vocabulary fitted with fit or the one provided to the constructor.


  • raw_documents : iterable An iterable which yields either str, unicode or file objects.


  • X : sparse matrix of shape (n_samples, n_features) Document-term matrix.


attribute vocabulary_
val vocabulary_ : t -> Dict.t
val vocabulary_opt : t -> (Dict.t) option

This attribute is documented in create above. The first version raises Not_found if the attribute is None. The _opt version returns an option.


attribute fixed_vocabulary_
val fixed_vocabulary_ : t -> bool
val fixed_vocabulary_opt : t -> (bool) option

This attribute is documented in create above. The first version raises Not_found if the attribute is None. The _opt version returns an option.


attribute stop_words_
val stop_words_ : t -> Py.Object.t
val stop_words_opt : t -> (Py.Object.t) option

This attribute is documented in create above. The first version raises Not_found if the attribute is None. The _opt version returns an option.


Module Sklearn.​Feature_extraction.​Text.​HashingVectorizer wraps Python class sklearn.feature_extraction.text.HashingVectorizer.

type t


constructor and attributes create
val create :
  ?input:[`Filename | `File | `Content] ->
  ?encoding:string ->
  ?decode_error:[`Strict | `Ignore | `Replace] ->
  ?strip_accents:[`Ascii | `Unicode] ->
  ?lowercase:bool ->
  ?preprocessor:Py.Object.t ->
  ?tokenizer:Py.Object.t ->
  ?stop_words:[`Arr of [>`ArrayLike] Np.Obj.t | `English] ->
  ?token_pattern:string ->
  ?ngram_range:Py.Object.t ->
  ?analyzer:[`Callable of Py.Object.t | `S of string | `Char | `PyObject of Py.Object.t] ->
  ?n_features:int ->
  ?binary:bool ->
  ?norm:[`L1 | `L2] ->
  ?alternate_sign:bool ->
  ?dtype:Np.Dtype.t ->
  unit ->

Convert a collection of text documents to a matrix of token occurrences

It turns a collection of text documents into a scipy.sparse matrix holding token occurrence counts (or binary occurrence information), possibly normalized as token frequencies if norm='l1' or projected on the euclidean unit sphere if norm='l2'.

This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

This strategy has several advantages:

  • it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory

  • it is fast to pickle and un-pickle as it holds no state besides the constructor parameters

  • it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

  • there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.

  • there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).

  • no IDF weighting as this would render the transformer stateful.

The hash function employed is the signed 32-bit version of Murmurhash3.

Read more in the :ref:User Guide <text_feature_extraction>.


  • input : string {'filename', 'file', 'content'}, default='content' If 'filename', the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze.

    If 'file', the sequence items must have a 'read' method (file-like object) that is called to fetch the bytes in memory.

    Otherwise the input is expected to be a sequence of items that can be of type string or byte.

  • encoding : string, default='utf-8' If bytes or files are given to analyze, this encoding is used to decode.

  • decode_error : {'strict', 'ignore', 'replace'}, default='strict' Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'.

  • strip_accents : {'ascii', 'unicode'}, default=None Remove accents and perform other character normalization during the preprocessing step. 'ascii' is a fast method that only works on characters that have an direct ASCII mapping. 'unicode' is a slightly slower method that works on any characters. None (default) does nothing.

    Both 'ascii' and 'unicode' use NFKD normalization from :func:unicodedata.normalize.

  • lowercase : bool, default=True Convert all characters to lowercase before tokenizing.

  • preprocessor : callable, default=None Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. Only applies if analyzer is not callable.

  • tokenizer : callable, default=None Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer == 'word'.

  • stop_words : string {'english'}, list, default=None If 'english', a built-in stop word list for English is used. There are several known issues with 'english' and you should consider an alternative (see :ref:stop_words).

    If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.

  • token_pattern : string Regular expression denoting what constitutes a 'token', only used if analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

  • ngram_range : tuple (min_n, max_n), default=(1, 1) The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.

  • analyzer : string, {'word', 'char', 'char_wb'} or callable, default='word' Whether the feature should be made of word or character n-grams. Option 'char_wb' creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

    If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.

    .. versionchanged:: 0.21

    Since v0.21, if input is filename or file, the data is first read from the file and then passed to the given callable analyzer.

  • n_features : int, default=(2 20)** The number of features (columns) in the output matrices. Small numbers of features are likely to cause hash collisions, but large numbers will cause larger coefficient dimensions in linear learners.

  • binary : bool, default=False. If True, all non zero counts are set to 1. This is useful for discrete probabilistic models that model binary events rather than integer counts.

  • norm : {'l1', 'l2'}, default='l2' Norm used to normalize term vectors. None for no normalization.

  • alternate_sign : bool, default=True When True, an alternating sign is added to the features as to approximately conserve the inner product in the hashed space even for small n_features. This approach is similar to sparse random projection.

    .. versionadded:: 0.19

  • dtype : type, default=np.float64 Type of the matrix returned by fit_transform() or transform().


>>> from sklearn.feature_extraction.text import HashingVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = HashingVectorizer(n_features=2**4)
>>> X = vectorizer.fit_transform(corpus)
>>> print(X.shape)
(4, 16)

See Also

CountVectorizer, TfidfVectorizer


method build_analyzer
val build_analyzer :
  [> tag] Obj.t ->

Return a callable that handles preprocessing, tokenization and n-grams generation.


  • analyzer: callable A function to handle preprocessing, tokenization and n-grams generation.


method build_preprocessor
val build_preprocessor :
  [> tag] Obj.t ->

Return a function to preprocess the text before tokenization.


  • preprocessor: callable A function to preprocess the text before tokenization.


method build_tokenizer
val build_tokenizer :
  [> tag] Obj.t ->

Return a function that splits a string into a sequence of tokens.


  • tokenizer: callable A function to split a string into a sequence of tokens.


method decode
val decode :
  doc:string ->
  [> tag] Obj.t ->

Decode the input into a string of unicode symbols.

The decoding strategy depends on the vectorizer parameters.


  • doc : str The string to decode.


  • doc: str A string of unicode symbols.


method fit
val fit :
  ?y:Py.Object.t ->
  x:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->

Does nothing: this transformer is stateless.


  • X : ndarray of shape [n_samples, n_features] Training data.


method fit_transform
val fit_transform :
  ?y:Py.Object.t ->
  x:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->
  [>`ArrayLike] Np.Obj.t

Transform a sequence of documents to a document-term matrix.


  • X : iterable over raw text documents, length = n_samples Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.

  • y : any Ignored. This parameter exists only for compatibility with sklearn.pipeline.Pipeline.


  • X : sparse matrix of shape (n_samples, n_features) Document-term matrix.


method get_params
val get_params :
  ?deep:bool ->
  [> tag] Obj.t ->

Get parameters for this estimator.


  • deep : bool, default=True If True, will return the parameters for this estimator and contained subobjects that are estimators.


  • params : mapping of string to any Parameter names mapped to their values.


method get_stop_words
val get_stop_words :
  [> tag] Obj.t ->
  [>`ArrayLike] Np.Obj.t option

Build or fetch the effective stop words list.


  • stop_words: list or None A list of stop words.


method partial_fit
val partial_fit :
  ?y:Py.Object.t ->
  x:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->

Does nothing: this transformer is stateless.

This method is just there to mark the fact that this transformer can work in a streaming setup.


  • X : ndarray of shape [n_samples, n_features] Training data.


method set_params
val set_params :
  ?params:(string * Py.Object.t) list ->
  [> tag] Obj.t ->

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.


  • **params : dict Estimator parameters.


  • self : object Estimator instance.


method transform
val transform :
  x:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->
  [>`ArrayLike] Np.Obj.t

Transform a sequence of documents to a document-term matrix.


  • X : iterable over raw text documents, length = n_samples Samples. Each sample must be a text document (either bytes or unicode strings, file name or file object depending on the constructor argument) which will be tokenized and hashed.


  • X : sparse matrix of shape (n_samples, n_features) Document-term matrix.


Module Sklearn.​Feature_extraction.​Text.​Mapping wraps Python class sklearn.feature_extraction.text.Mapping.

type t


method get_item
val get_item :
  key:Py.Object.t ->
  [> tag] Obj.t ->


method iter
val iter :
  [> tag] Obj.t ->
  Dict.t Seq.t


method get
val get :
  ?default:Py.Object.t ->
  key:Py.Object.t ->
  [> tag] Obj.t ->

D.get(k[,d]) -> D[k] if k in D, else d. d defaults to None.


method items
val items :
  [> tag] Obj.t ->

D.items() -> a set-like object providing a view on D's items


method keys
val keys :
  [> tag] Obj.t ->

D.keys() -> a set-like object providing a view on D's keys


method values
val values :
  [> tag] Obj.t ->

D.values() -> an object providing a view on D's values


Module Sklearn.​Feature_extraction.​Text.​TfidfTransformer wraps Python class sklearn.feature_extraction.text.TfidfTransformer.

type t


constructor and attributes create
val create :
  ?norm:[`L1 | `L2] ->
  ?use_idf:bool ->
  ?smooth_idf:bool ->
  ?sublinear_tf:bool ->
  unit ->

Transform a count matrix to a normalized tf or tf-idf representation

Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification.

The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a given document is to scale down the impact of tokens that occur very frequently in a given corpus and that are hence empirically less informative than features that occur in a small fraction of the training corpus.

The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False), where n is the total number of documents in the document set and df(t) is the document frequency of t; the document frequency is the number of documents in the document set that contain the term t. The effect of adding '1' to the idf in the equation above is that terms with zero idf, i.e., terms that occur in all documents in a training set, will not be entirely ignored. (Note that the idf formula above differs from the standard textbook notation that defines the idf as idf(t) = log [ n / (df(t) + 1) ]).

If smooth_idf=True (the default), the constant '1' is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions: idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1.

Furthermore, the formulas used to compute tf and idf depend on parameter settings that correspond to the SMART notation used in IR as follows:

Tf is 'n' (natural) by default, 'l' (logarithmic) when sublinear_tf=True. Idf is 't' when use_idf is given, 'n' (none) otherwise. Normalization is 'c' (cosine) when norm='l2', 'n' (none) when norm=None.

Read more in the :ref:User Guide <text_feature_extraction>.


  • norm : {'l1', 'l2'}, default='l2' Each output row will have unit norm, either:

    • 'l2': Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied.
    • 'l1': Sum of absolute values of vector elements is 1.
  • See :func:preprocessing.normalize

  • use_idf : bool, default=True Enable inverse-document-frequency reweighting.

  • smooth_idf : bool, default=True Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

  • sublinear_tf : bool, default=False Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).


  • idf_ : array of shape (n_features) The inverse document frequency (IDF) vector; only defined if use_idf is True.

    .. versionadded:: 0.20


>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.pipeline import Pipeline
>>> import numpy as np
>>> corpus = ['this is the first document',
...           'this document is the second document',
...           'and this is the third one',
...           'is this the first document']
>>> vocabulary = ['this', 'document', 'first', 'is', 'second', 'the',
...               'and', 'one']
>>> pipe = Pipeline([('count', CountVectorizer(vocabulary=vocabulary)),
...                  ('tfid', TfidfTransformer())]).fit(corpus)
>>> pipe['count'].transform(corpus).toarray()
array([[1, 1, 1, 1, 0, 1, 0, 0],
       [1, 2, 0, 1, 1, 1, 0, 0],
       [1, 0, 0, 1, 0, 1, 1, 1],
       [1, 1, 1, 1, 0, 1, 0, 0]])
>>> pipe['tfid'].idf_
array([1.        , 1.22314355, 1.51082562, 1.        , 1.91629073,
       1.        , 1.91629073, 1.91629073])
>>> pipe.transform(corpus).shape
(4, 8)


.. [Yates2011] R. Baeza-Yates and B. Ribeiro-Neto (2011). Modern Information Retrieval. Addison Wesley, pp. 68-74.

.. [MRS2008] C.D. Manning, P. Raghavan and H. Schütze (2008). Introduction to Information Retrieval. Cambridge University Press, pp. 118-120.


method fit
val fit :
  ?y:Py.Object.t ->
  x:Py.Object.t ->
  [> tag] Obj.t ->

Learn the idf vector (global term weights).


  • X : sparse matrix of shape n_samples, n_features) A matrix of term/token counts.


method fit_transform
val fit_transform :
  ?y:[>`ArrayLike] Np.Obj.t ->
  ?fit_params:(string * Py.Object.t) list ->
  x:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->
  [>`ArrayLike] Np.Obj.t

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.


  • X : {array-like, sparse matrix, dataframe} of shape (n_samples, n_features)

  • y : ndarray of shape (n_samples,), default=None Target values.

  • **fit_params : dict Additional fit parameters.


  • X_new : ndarray array of shape (n_samples, n_features_new) Transformed array.


method get_params
val get_params :
  ?deep:bool ->
  [> tag] Obj.t ->

Get parameters for this estimator.


  • deep : bool, default=True If True, will return the parameters for this estimator and contained subobjects that are estimators.


  • params : mapping of string to any Parameter names mapped to their values.


method set_params
val set_params :
  ?params:(string * Py.Object.t) list ->
  [> tag] Obj.t ->

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.


  • **params : dict Estimator parameters.


  • self : object Estimator instance.


method transform
val transform :
  ?copy:bool ->
  x:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->
  [>`ArrayLike] Np.Obj.t

Transform a count matrix to a tf or tf-idf representation


  • X : sparse matrix of (n_samples, n_features) a matrix of term/token counts

  • copy : bool, default=True Whether to copy X and operate on the copy or perform in-place operations.


  • vectors : sparse matrix of shape (n_samples, n_features)


attribute idf_
val idf_ : t -> [>`ArrayLike] Np.Obj.t
val idf_opt : t -> ([>`ArrayLike] Np.Obj.t) option

This attribute is documented in create above. The first version raises Not_found if the attribute is None. The _opt version returns an option.


Module Sklearn.​Feature_extraction.​Text.​TfidfVectorizer wraps Python class sklearn.feature_extraction.text.TfidfVectorizer.

type t


constructor and attributes create
val create :
  ?input:[`Filename | `File | `Content] ->
  ?encoding:string ->
  ?decode_error:[`Strict | `Ignore | `Replace] ->
  ?strip_accents:[`Ascii | `Unicode] ->
  ?lowercase:bool ->
  ?preprocessor:Py.Object.t ->
  ?tokenizer:Py.Object.t ->
  ?analyzer:[`Char_wb | `Callable of Py.Object.t | `Word | `Char] ->
  ?stop_words:[`Arr of [>`ArrayLike] Np.Obj.t | `English] ->
  ?token_pattern:string ->
  ?ngram_range:Py.Object.t ->
  ?max_df:[`F of float | `I of int] ->
  ?min_df:[`F of float | `I of int] ->
  ?max_features:int ->
  ?vocabulary:[`Arr of [>`ArrayLike] Np.Obj.t | `Mapping of Py.Object.t] ->
  ?binary:bool ->
  ?dtype:Np.Dtype.t ->
  ?norm:[`L1 | `L2] ->
  ?use_idf:bool ->
  ?smooth_idf:bool ->
  ?sublinear_tf:bool ->
  unit ->

Convert a collection of raw documents to a matrix of TF-IDF features.

Equivalent to :class:CountVectorizer followed by :class:TfidfTransformer.

Read more in the :ref:User Guide <text_feature_extraction>.


  • input : {'filename', 'file', 'content'}, default='content' If 'filename', the sequence passed as an argument to fit is expected to be a list of filenames that need reading to fetch the raw content to analyze.

    If 'file', the sequence items must have a 'read' method (file-like object) that is called to fetch the bytes in memory.

    Otherwise the input is expected to be a sequence of items that can be of type string or byte.

  • encoding : str, default='utf-8' If bytes or files are given to analyze, this encoding is used to decode.

  • decode_error : {'strict', 'ignore', 'replace'}, default='strict' Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given encoding. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'.

  • strip_accents : {'ascii', 'unicode'}, default=None Remove accents and perform other character normalization during the preprocessing step. 'ascii' is a fast method that only works on characters that have an direct ASCII mapping. 'unicode' is a slightly slower method that works on any characters. None (default) does nothing.

    Both 'ascii' and 'unicode' use NFKD normalization from :func:unicodedata.normalize.

  • lowercase : bool, default=True Convert all characters to lowercase before tokenizing.

  • preprocessor : callable, default=None Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. Only applies if analyzer is not callable.

  • tokenizer : callable, default=None Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if analyzer == 'word'.

  • analyzer : {'word', 'char', 'char_wb'} or callable, default='word' Whether the feature should be made of word or character n-grams. Option 'char_wb' creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

    If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.

    .. versionchanged:: 0.21

    Since v0.21, if input is filename or file, the data is first read from the file and then passed to the given callable analyzer.

  • stop_words : {'english'}, list, default=None If a string, it is passed to _check_stop_list and the appropriate stop list is returned. 'english' is currently the only supported string value. There are several known issues with 'english' and you should consider an alternative (see :ref:stop_words).

    If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if analyzer == 'word'.

    If None, no stop words will be used. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.

  • token_pattern : str Regular expression denoting what constitutes a 'token', only used if analyzer == 'word'. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

  • ngram_range : tuple (min_n, max_n), default=(1, 1) The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.

  • max_df : float or int, default=1.0 When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

  • min_df : float or int, default=1 When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float in range of [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

  • max_features : int, default=None If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

    This parameter is ignored if vocabulary is not None.

  • vocabulary : Mapping or iterable, default=None Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.

  • binary : bool, default=False If True, all non-zero term counts are set to 1. This does not mean outputs will have only 0/1 values, only that the tf term in tf-idf is binary. (Set idf and normalization to False to get 0/1 outputs).

  • dtype : dtype, default=float64 Type of the matrix returned by fit_transform() or transform().

  • norm : {'l1', 'l2'}, default='l2' Each output row will have unit norm, either:

    • 'l2': Sum of squares of vector elements is 1. The cosine similarity between two vectors is their dot product when l2 norm has been applied.
    • 'l1': Sum of absolute values of vector elements is 1.
  • See :func:preprocessing.normalize.

  • use_idf : bool, default=True Enable inverse-document-frequency reweighting.

  • smooth_idf : bool, default=True Smooth idf weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once. Prevents zero divisions.

  • sublinear_tf : bool, default=False Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).


  • vocabulary_ : dict A mapping of terms to feature indices.

  • fixed_vocabulary_: bool True if a fixed vocabulary of term to indices mapping is provided by the user

  • idf_ : array of shape (n_features,) The inverse document frequency (IDF) vector; only defined if use_idf is True.

  • stop_words_ : set Terms that were ignored because they either:

    • occurred in too many documents (max_df)
    • occurred in too few documents (min_df)
    • were cut off by feature selection (max_features).

    This is only available if no vocabulary was given.

See Also

  • CountVectorizer : Transforms text into a sparse matrix of n-gram counts.

  • TfidfTransformer : Performs the TF-IDF transformation from a provided matrix of counts.


The stop_words_ attribute can get large and increase the model size when pickling. This attribute is provided only for introspection and can be safely removed using delattr or set to None before pickling.


>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> corpus = [
...     'This is the first document.',
...     'This document is the second document.',
...     'And this is the third one.',
...     'Is this the first document?',
... ]
>>> vectorizer = TfidfVectorizer()
>>> X = vectorizer.fit_transform(corpus)
>>> print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
>>> print(X.shape)
(4, 9)


method build_analyzer
val build_analyzer :
  [> tag] Obj.t ->

Return a callable that handles preprocessing, tokenization and n-grams generation.


  • analyzer: callable A function to handle preprocessing, tokenization and n-grams generation.


method build_preprocessor
val build_preprocessor :
  [> tag] Obj.t ->

Return a function to preprocess the text before tokenization.


  • preprocessor: callable A function to preprocess the text before tokenization.


method build_tokenizer
val build_tokenizer :
  [> tag] Obj.t ->

Return a function that splits a string into a sequence of tokens.


  • tokenizer: callable A function to split a string into a sequence of tokens.


method decode
val decode :
  doc:string ->
  [> tag] Obj.t ->

Decode the input into a string of unicode symbols.

The decoding strategy depends on the vectorizer parameters.


  • doc : str The string to decode.


  • doc: str A string of unicode symbols.


method fit
val fit :
  ?y:Py.Object.t ->
  raw_documents:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->

Learn vocabulary and idf from training set.


  • raw_documents : iterable An iterable which yields either str, unicode or file objects.

  • y : None This parameter is not needed to compute tfidf.


  • self : object Fitted vectorizer.


method fit_transform
val fit_transform :
  ?y:Py.Object.t ->
  raw_documents:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->
  [>`ArrayLike] Np.Obj.t

Learn vocabulary and idf, return document-term matrix.

This is equivalent to fit followed by transform, but more efficiently implemented.


  • raw_documents : iterable An iterable which yields either str, unicode or file objects.

  • y : None This parameter is ignored.


  • X : sparse matrix of (n_samples, n_features) Tf-idf-weighted document-term matrix.


method get_feature_names
val get_feature_names :
  [> tag] Obj.t ->
  string list

Array mapping from feature integer indices to feature name.


  • feature_names : list A list of feature names.


method get_params
val get_params :
  ?deep:bool ->
  [> tag] Obj.t ->

Get parameters for this estimator.


  • deep : bool, default=True If True, will return the parameters for this estimator and contained subobjects that are estimators.


  • params : mapping of string to any Parameter names mapped to their values.


method get_stop_words
val get_stop_words :
  [> tag] Obj.t ->
  [>`ArrayLike] Np.Obj.t option

Build or fetch the effective stop words list.


  • stop_words: list or None A list of stop words.


method inverse_transform
val inverse_transform :
  x:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->

Return terms per document with nonzero entries in X.


  • X : {array-like, sparse matrix} of shape (n_samples, n_features) Document-term matrix.


  • X_inv : list of arrays of shape (n_samples,) List of arrays of terms.


method set_params
val set_params :
  ?params:(string * Py.Object.t) list ->
  [> tag] Obj.t ->

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it's possible to update each component of a nested object.


  • **params : dict Estimator parameters.


  • self : object Estimator instance.


method transform
val transform :
  ?copy:bool ->
  raw_documents:[>`ArrayLike] Np.Obj.t ->
  [> tag] Obj.t ->
  [>`ArrayLike] Np.Obj.t

Transform documents to document-term matrix.

Uses the vocabulary and document frequencies (df) learned by fit (or fit_transform).


  • raw_documents : iterable An iterable which yields either str, unicode or file objects.

  • copy : bool, default=True Whether to copy X and operate on the copy or perform in-place operations.

    .. deprecated:: 0.22 The copy parameter is unused and was deprecated in version 0.22 and will be removed in 0.24. This parameter will be ignored.


  • X : sparse matrix of (n_samples, n_features) Tf-idf-weighted document-term matrix.


attribute vocabulary_
val vocabulary_ : t -> Dict.t
val vocabulary_opt : t -> (Dict.t) option

This attribute is documented in create above. The first version raises Not_found if the attribute is None. The _opt version returns an option.


attribute fixed_vocabulary_
val fixed_vocabulary_ : t -> bool
val fixed_vocabulary_opt : t -> (bool) option

This attribute is documented in create above. The first version raises Not_found if the attribute is None. The _opt version returns an option.


attribute idf_
val idf_ : t -> [>`ArrayLike] Np.Obj.t
val idf_opt : t -> ([>`ArrayLike] Np.Obj.t) option

This attribute is documented in create above. The first version raises Not_found if the attribute is None. The _opt version returns an option.


attribute stop_words_
val stop_words_ : t -> Py.Object.t
val stop_words_opt : t -> (Py.Object.t) option

This attribute is documented in create above. The first version raises Not_found if the attribute is None. The _opt version returns an option.


Module Sklearn.​Feature_extraction.​Text.​Itemgetter wraps Python class sklearn.feature_extraction.text.itemgetter.

type t


Module Sklearn.​Feature_extraction.​Text.​Partial wraps Python class sklearn.feature_extraction.text.partial.

type t


constructor and attributes create
val create :
  ?keywords:(string * Py.Object.t) list ->
  func:Py.Object.t ->
  Py.Object.t list ->

partial(func, args, *keywords) - new function with partial application of the given arguments and keywords.


function check_array
val check_array :
  ?accept_sparse:[`S of string | `StringList of string list | `Bool of bool] ->
  ?accept_large_sparse:bool ->
  ?dtype:[`Dtypes of Np.Dtype.t list | `S of string | `Dtype of Np.Dtype.t | `None] ->
  ?order:[`F | `C] ->
  ?copy:bool ->
  ?force_all_finite:[`Allow_nan | `Bool of bool] ->
  ?ensure_2d:bool ->
  ?allow_nd:bool ->
  ?ensure_min_samples:int ->
  ?ensure_min_features:int ->
  ?estimator:[>`BaseEstimator] Np.Obj.t ->
  array:Py.Object.t ->
  unit ->

Input validation on an array, list, sparse matrix or similar.

By default, the input is checked to be a non-empty 2D array containing only finite values. If the dtype of the array is object, attempt converting to float, raising on failure.


  • array : object Input object to check / convert.

  • accept_sparse : string, boolean or list/tuple of strings (default=False) String[s] representing allowed sparse matrix formats, such as 'csc', 'csr', etc. If the input is sparse but not in the allowed format, it will be converted to the first listed format. True allows the input to be any format. False means that a sparse matrix input will raise an error.

  • accept_large_sparse : bool (default=True) If a CSR, CSC, COO or BSR sparse matrix is supplied and accepted by accept_sparse, accept_large_sparse=False will cause it to be accepted only if its indices are stored with a 32-bit dtype.

    .. versionadded:: 0.20

  • dtype : string, type, list of types or None (default='numeric') Data type of result. If None, the dtype of the input is preserved. If 'numeric', dtype is preserved unless array.dtype is object. If dtype is a list of types, conversion on the first type is only performed if the dtype of the input is not in the list.

  • order : 'F', 'C' or None (default=None) Whether an array will be forced to be fortran or c-style. When order is None (default), then if copy=False, nothing is ensured about the memory layout of the output array; otherwise (copy=True) the memory layout of the returned array is kept as close as possible to the original array.

  • copy : boolean (default=False) Whether a forced copy will be triggered. If copy=False, a copy might be triggered by a conversion.

  • force_all_finite : boolean or 'allow-nan', (default=True) Whether to raise an error on np.inf, np.nan, pd.NA in array. The possibilities are:

    • True: Force all values of array to be finite.
    • False: accepts np.inf, np.nan, pd.NA in array.
    • 'allow-nan': accepts only np.nan and pd.NA values in array. Values cannot be infinite.

    .. versionadded:: 0.20 force_all_finite accepts the string 'allow-nan'.

    .. versionchanged:: 0.23 Accepts pd.NA and converts it into np.nan

  • ensure_2d : boolean (default=True) Whether to raise a value error if array is not 2D.

  • allow_nd : boolean (default=False) Whether to allow array.ndim > 2.

  • ensure_min_samples : int (default=1) Make sure that the array has a minimum number of samples in its first axis (rows for a 2D array). Setting to 0 disables this check.

  • ensure_min_features : int (default=1) Make sure that the 2D array has some minimum number of features (columns). The default value of 1 rejects empty datasets. This check is only enforced when the input data has effectively 2 dimensions or is originally 1D and ensure_2d is True. Setting to 0 disables this check.

  • estimator : str or estimator instance (default=None) If passed, include the name of the estimator in warning messages.


  • array_converted : object The converted and validated array.


function check_is_fitted
val check_is_fitted :
  ?attributes:[`Arr of [>`ArrayLike] Np.Obj.t | `S of string | `StringList of string list] ->
  ?msg:string ->
  ?all_or_any:[`Callable of Py.Object.t | `PyObject of Py.Object.t] ->
  estimator:[>`BaseEstimator] Np.Obj.t ->
  unit ->

Perform is_fitted validation for estimator.

Checks if the estimator is fitted by verifying the presence of fitted attributes (ending with a trailing underscore) and otherwise raises a NotFittedError with the given message.

This utility is meant to be used internally by estimators themselves, typically in their own predict / transform methods.


  • estimator : estimator instance. estimator instance for which the check is performed.

  • attributes : str, list or tuple of str, default=None Attribute name(s) given as string or a list/tuple of strings

  • Eg.: ['coef_', 'estimator_', ...], 'coef_'

    If None, estimator is considered fitted if there exist an attribute that ends with a underscore and does not start with double underscore.

  • msg : string The default error message is, 'This %(name)s instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.'

    For custom messages if '%(name)s' is present in the message string, it is substituted for the estimator name.

  • Eg. : 'Estimator, %(name)s, must be fitted before sparsifying'.

  • all_or_any : callable, {all, any}, default all Specify whether all or any of the given attributes must exist.




NotFittedError If the attributes are not found.


function normalize
val normalize :
  ?norm:[`L1 | `L2 | `Max] ->
  ?axis:[`Zero | `One] ->
  ?copy:bool ->
  ?return_norm:bool ->
  x:[>`ArrayLike] Np.Obj.t ->
  unit ->
  ([>`ArrayLike] Np.Obj.t * [>`ArrayLike] Np.Obj.t)

Scale input vectors individually to unit norm (vector length).

Read more in the :ref:User Guide <preprocessing_normalization>.


  • X : {array-like, sparse matrix}, shape [n_samples, n_features] The data to normalize, element by element. scipy.sparse matrices should be in CSR format to avoid an un-necessary copy.

  • norm : 'l1', 'l2', or 'max', optional ('l2' by default) The norm to use to normalize each non zero sample (or each non-zero feature if axis is 0).

  • axis : 0 or 1, optional (1 by default) axis used to normalize the data along. If 1, independently normalize each sample, otherwise (if 0) normalize each feature.

  • copy : boolean, optional, default True set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR matrix and if axis is 1).

  • return_norm : boolean, default False whether to return the computed norms


  • X : {array-like, sparse matrix}, shape [n_samples, n_features] Normalized input X.

  • norms : array, shape [n_samples] if axis=1 else [n_features] An array of norms along given axis for X. When X is sparse, a NotImplementedError will be raised for norm 'l1' or 'l2'.

See also

  • Normalizer: Performs normalization using the Transformer API (e.g. as part of a preprocessing :class:sklearn.pipeline.Pipeline).


For a comparison of the different scalers, transformers, and normalizers,

  • **see :ref:examples/preprocessing/** <>.


function strip_accents_ascii
val strip_accents_ascii :
  string ->

Transform accentuated unicode symbols into ascii or nothing

  • Warning: this solution is only suited for languages that have a direct transliteration to ASCII symbols.


  • s : string The string to strip

See Also

strip_accents_unicode Remove accentuated char for any unicode symbol.


function strip_accents_unicode
val strip_accents_unicode :
  string ->

Transform accentuated unicode symbols into their simple counterpart

  • Warning: the python-level loop and join operations make this implementation 20 times slower than the strip_accents_ascii basic normalization.


  • s : string The string to strip

See Also

strip_accents_ascii Remove accentuated char for any unicode symbol that has a direct ASCII equivalent.


function strip_tags
val strip_tags :
  string ->

Basic regexp based HTML / XML tag stripper function

For serious HTML/XML preprocessing you should rather use an external library such as lxml or BeautifulSoup.


  • s : string The string to strip


function grid_to_graph
val grid_to_graph :
  ?n_z:int ->
  ?mask:[`Arr of [>`ArrayLike] Np.Obj.t | `Dtype_bool of Py.Object.t] ->
  ?return_as:Py.Object.t ->
  ?dtype:Np.Dtype.t ->
  n_x:int ->
  n_y:int ->
  unit ->

Graph of the pixel-to-pixel connections

Edges exist if 2 voxels are connected.


  • n_x : int Dimension in x axis

  • n_y : int Dimension in y axis

  • n_z : int, default=1 Dimension in z axis

  • mask : ndarray of shape (n_x, n_y, n_z), dtype=bool, default=None An optional mask of the image, to consider only part of the pixels.

  • return_as : np.ndarray or a sparse matrix class, default=sparse.coo_matrix The class to use to build the returned adjacency matrix.

  • dtype : dtype, default=int The data of the returned sparse matrix. By default it is int


For scikit-learn versions 0.14.1 and prior, return_as=np.ndarray was handled by returning a dense np.matrix instance. Going forward, np.ndarray returns an np.ndarray, as expected.

For compatibility, user code relying on this method should wrap its calls in np.asarray to avoid type issues.


function img_to_graph
val img_to_graph :
  ?mask:[`Arr of [>`ArrayLike] Np.Obj.t | `Dtype_bool of Py.Object.t] ->
  ?return_as:Py.Object.t ->
  ?dtype:Np.Dtype.t ->
  img:[>`ArrayLike] Np.Obj.t ->
  unit ->

Graph of the pixel-to-pixel gradient connections

Edges are weighted with the gradient values.

Read more in the :ref:User Guide <image_feature_extraction>.


  • img : ndarray of shape (height, width) or (height, width, channel) 2D or 3D image.

  • mask : ndarray of shape (height, width) or (height, width, channel), dtype=bool, default=None An optional mask of the image, to consider only part of the pixels.

  • return_as : np.ndarray or a sparse matrix class, default=sparse.coo_matrix The class to use to build the returned adjacency matrix.

  • dtype : dtype, default=None The data of the returned sparse matrix. By default it is the dtype of img


For scikit-learn versions 0.14.1 and prior, return_as=np.ndarray was handled by returning a dense np.matrix instance. Going forward, np.ndarray returns an np.ndarray, as expected.

For compatibility, user code relying on this method should wrap its calls in np.asarray to avoid type issues.