data_manipulation package

Submodules

data_manipulation.base module

data_manipulation.base.clean_string(string: str, remove_parenthesis: bool = False, remove_brackets: bool = False) str[source]

Cleans and standardizes input string.

Parameters:
  • string (str) – String to clean.

  • remove_parenthesis (bool) – Whether to remove content within parentheses.

  • remove_brackets (bool) – Whether to remove content within square brackets.

Returns:

Uppercase cleaned string with standardized spacing.

Return type:

str

Raises:

ValueError – If input string is None or empty.

Examples

>>> clean_string(" sHawn  tesT ")
'SHAWN TEST'
>>> clean_string("shawn ( te  st )")
'SHAWN ( TE ST )'
>>> clean_string("shawn ( te  st )", remove_parenthesis=True)
'SHAWN'
>>> clean_string("shawn [ te  st ]", remove_brackets=True)
'SHAWN'
data_manipulation.base.create_encode_url(url: str, query_params: Dict[str, Any] | None = None) str[source]

Creates an encoded URL with query parameters.

Parameters:
  • url (str) – Base URL.

  • query_params (dict, optional) – Dictionary of URL query parameters. Defaults to {}.

Returns:

Encoded URL with query parameters.

Return type:

str

Raises:

ValueError – If URL is invalid.

data_manipulation.base.delete_list_indices(list_: list, indices: List[int]) None[source]

Deletes multiple indices from a list in-place.

Parameters:
  • list (list) – Original list to modify.

  • indices (list) – List of indices to delete.

Examples

>>> values = [0, 1, 2, 3, 4]
>>> delete_list_indices(values, [1, 3])
>>> values
[0, 2, 4]
data_manipulation.base.deprecated(func)[source]

Decorator to mark functions as deprecated.

data_manipulation.base.get_country_code_variation() dict[source]

Returns a dictionary mapping country codes to their name variations.

Returns:

Dictionary with country codes as keys and lists of country name variations as values.

Return type:

dict

data_manipulation.base.get_country_name_variation() dict[source]

Returns a dictionary mapping country names to their codes.

Returns:

Dictionary with country names as keys and country codes as values.

Return type:

dict

data_manipulation.base.get_none_variation() List[None | str][source]

Returns a list of common variations of None/null values.

Returns:

List containing None and various string representations of null values.

Return type:

list

Examples

>>> get_none_variation()
[None, 'NONE', 'NONe', 'NOnE', 'NOne', 'NoNE', 'NoNe', 'NonE', 'None', 'nONE', 'nONe', 'nOnE', 'nOne', 'noNE', 'noNe', 'nonE', 'none', 'NULL', 'NULl', 'NUlL', 'NUll', 'NuLL', 'NuLl', 'NulL', 'Null', 'nULL', 'nULl', 'nUlL', 'nUll', 'nuLL', 'nuLl', 'nulL', 'null', 'NA', 'Na', 'nA', 'na', 'N.A', 'N.a', 'N.A', 'N.a', 'n.A', 'n.a', 'n.A', 'n.a', 'N.A.', 'N.A.', 'N.a.', 'N.a.', 'N.A.', 'N.A.', 'N.a.', 'N.a.', 'n.A.', 'n.A.', 'n.a.', 'n.a.', 'n.A.', 'n.A.', 'n.a.', 'n.a.', 'NAN', 'NAn', 'NaN', 'Nan', 'nAN', 'nAn', 'naN', 'nan', 'NIL', 'NIl', 'NiL', 'Nil', 'nIL', 'nIl', 'niL', 'nil']
data_manipulation.base.get_path_files(path: str | Path, keywords: List[str]) List[str][source]

Returns sorted list of files from given path that contain specified keywords.

Parameters:
  • path (str) – Directory path to search.

  • keywords (list) – List of keywords to match in filenames.

Returns:

Sorted list of matching filenames.

Return type:

List[str]

Examples

>>> get_path_files("test_base_folder", ["py"])
['test1.py', 'test2.py', 'test3.py', 'test4.py', 'test5.py']
data_manipulation.base.get_string_case_combination(str_: str) List[str][source]

Generates all possible case combinations of a string.

Parameters:

str (str) – Input string to generate combinations for.

Returns:

List of all possible case combinations.

Return type:

List[str]

Examples

>>> get_string_case_combination("abc")
['ABC', 'ABc', 'AbC', 'Abc', 'aBC', 'aBc', 'abC', 'abc']
data_manipulation.base.list_to_file(filepath: str | Path, list_: List, newline: bool = True) None[source]

Writes list contents to a file.

Parameters:
  • filepath (str) – Path to output file.

  • list (list) – List of values to write.

  • newline (bool, optional) – Whether to add newline after each item. Defaults to True.

Examples

>>> list_to_file("test.txt", [1, 2, 3])
data_manipulation.base.list_tuple_without_none(list_tuple: List | Tuple) List | Tuple[source]

Removes None variations from a list or tuple.

Parameters:

list_tuple – Input list or tuple to clean.

Returns:

Cleaned list or tuple.

Return type:

Union[List, Tuple]

Raises:

TypeError – If input is not a list or tuple.

Examples

>>> list_tuple_without_none(["a", "none"])
['a']
>>> list_tuple_without_none(("a", "none"))
('a',)
data_manipulation.base.parse_ps_aux(ps_aux_commands: str) List[List[str]][source]

Parses Linux ps aux command output into a list of records.

Parameters:

ps_aux_commands (str) – Linux ps aux command string.

Returns:

List of process records.

Return type:

List[List[str]]

Examples

>>> # parse_ps_aux("ps aux | egrep -i '%cpu|anaconda3' | head")
data_manipulation.base.remove_path_file(path: str | Path, keyword: str, n: int = 2) None[source]

Removes all but the n newest files matching the keyword.

Parameters:
  • path (str) – Directory path.

  • keyword (str) – Keyword to match in filenames.

  • n (int, optional) – Number of newest files to keep. Defaults to 2.

Examples

>>> remove_path_file("test_base_folder", ".py")
data_manipulation.base.string_boolean_to_int(boolean_str_rep: str) int[source]

Converts string boolean representations to integers.

Parameters:

boolean_str_rep (str) – String representation of boolean value.

Returns:

1 for true values, 0 for false values.

Return type:

int

Examples

>>> string_boolean_to_int("true")
1
>>> string_boolean_to_int("True")
1
data_manipulation.base.string_dlt_to_dlt(dlt_str_rep: str) Dict | List | Tuple[source]

Converts string representation of data structures to actual Python objects.

Parameters:

dlt_str_rep (str) – String representation of dictionary/list/tuple.

Returns:

Converted Python data structure.

Return type:

Union[Dict, List, Tuple]

Examples

>>> string_dlt_to_dlt("[1, 2, 3]")
[1, 2, 3]
>>> string_dlt_to_dlt("{'a': 1, 'b': 2}")
{'a': 1, 'b': 2}
>>> string_dlt_to_dlt("('1', '2', '3')")
('1', '2', '3')
data_manipulation.base.string_str_to_str(string_str_rep: str) str[source]

Converts string representation to a clean string by removing quotes.

Parameters:

string_str_rep (str) – String representation to clean.

Returns:

Cleaned string with outer quotes removed.

Return type:

str

Examples

>>> string_str_to_str("'test'")
'test'
>>> string_str_to_str('"test"')
'test'

data_manipulation.beautifulsoup_ module

data_manipulation.beautifulsoup_.build_soup(url: str, features: str = 'lxml', to_preprocess: bool = True, timeout: int = 10, headers: dict | None = None) BeautifulSoup | None[source]

Creates a BeautifulSoup object from a given URL.

Parameters:
  • url (str) – URL to fetch and parse.

  • features (str, optional) – Parser to use. Defaults to “lxml”.

  • to_preprocess (bool, optional) – Whether to preprocess the HTML. Defaults to True.

  • timeout (int, optional) – Request timeout in seconds. Defaults to 10.

  • headers (Optional[dict], optional) – Custom headers for the request. Defaults to None.

Returns:

Parsed BeautifulSoup object, or None if request fails.

Return type:

Optional[BeautifulSoup]

Examples

>>> a = build_soup("https://google.com")
>>> type(a)
<class 'bs4.BeautifulSoup'>

Note

Requires requests and beautifulsoup4 packages.

data_manipulation.beautifulsoup_.preprocess(html: str) str | None[source]

Removes whitespaces and newline characters from HTML string.

Parameters:

html (str) – HTML string to be cleaned.

Returns:

Cleaned HTML string with normalized whitespace.

Return type:

Optional[str]

Examples

>>> a = "<html>   <p> Something </p>    </html> "
>>> preprocess(a)
'<html><p>Something</p></html>'

data_manipulation.boto3_ module

data_manipulation.boto3_.list_s3_bucket_files(bucket: str, to_dateframe: bool = False, prefix: str | None = None) List[str] | pd.DataFrame[source]

Lists all files in an S3 bucket.

Parameters:
  • bucket (str) – Name of the S3 bucket.

  • to_dateframe (bool, optional) – Whether to return results as pandas DataFrame. Defaults to False.

  • prefix (Optional[str], optional) – Filter results to files with this prefix. Defaults to None.

Returns:

List of file keys or DataFrame containing file keys.

If to_dateframe is True, returns DataFrame with ‘key’ column. If to_dateframe is False, returns list of file keys.

Return type:

Union[List[str], pd.DataFrame]

Examples

>>> files = list_s3_bucket_files('my-bucket')
>>> type(files)
<class 'list'>
>>> df = list_s3_bucket_files('my-bucket', to_dateframe=True)
>>> type(df)
<class 'pandas.core.frame.DataFrame'>
data_manipulation.boto3_.send_aws_ses_email(sender: str, recipient: List[str], subject: str, body_text: str, body_type: str, ses_client: Any, attachment: str | None = None, max_attachment_size: int = 10485760, max_retries: int = 3) Dict[str, Any] | None[source]

Sends an email using AWS SES service.

Parameters:
  • sender (str) – Sender’s email address.

  • recipient (list) – List of recipient email addresses.

  • subject (str) – Email subject line.

  • body_text (str) – Email body content.

  • body_type (str) – MIME type of email body (e.g., ‘plain’, ‘html’).

  • ses_client – AWS SES client instance.

  • attachment (str, optional) – Path to file to attach. Defaults to None.

  • max_attachment_size (int, optional) – Maximum allowed attachment size in bytes. Defaults to 10MB.

  • max_retries (int, optional) – Maximum number of retry attempts for failed sends. Defaults to 3.

Returns:

AWS SES response dictionary if successful, None if failed.

Return type:

dict

data_manipulation.cryptography_ module

data_manipulation.cryptography_.decrypt_fernet_data(keypath: str | Path, filepath: str | Path) bytes[source]

Decrypts a file using Fernet symmetric encryption.

Parameters:
  • keypath (str) – Path to the Fernet key file.

  • filepath (str) – Path to the encrypted file.

Returns:

Decrypted data.

Return type:

str

Raises:
  • FileNotFoundError – If key file or input file doesn’t exist

  • InvalidToken – If the key is invalid or data is corrupted

  • TypeError – If input types are incorrect

data_manipulation.cryptography_.encrypt_fernet_file(keypath: str | Path, filepath: str | Path) bytes[source]

Encrypts a file using Fernet symmetric encryption.

Parameters:
  • keypath (str) – Path to the Fernet key file.

  • filepath (str) – Path to the file to be encrypted.

Returns:

Encrypted data.

Return type:

str

Raises:

TypeError – If keypath or filepath are not strings.

Examples

>>> encrypted = encrypt_fernet_file('key.txt', 'data.txt')
>>> isinstance(encrypted, str)
True
data_manipulation.cryptography_.generate_fernet_key(output_directory: str | Path, output_filename: str) bytes[source]

Generates and saves a Fernet encryption key.

Parameters:
  • output_directory (str) – Directory path where the key file will be saved.

  • output_filename (str) – Name of the key file to be created.

Returns:

Generated Fernet key in bytes format.

Return type:

bytes

Raises:

Exception – If key generation or file writing fails.

Examples

>>> key = generate_fernet_key('/path/to/keys', 'encryption.key')
>>> isinstance(key, bytes)
True

data_manipulation.django_ module

data_manipulation.django_.django_validate_email(email: str, whitelist_domains: List[str] | None = None) str | None[source]

Validates an email address using Django’s validator.

Parameters:
  • email (str) – Email address to validate.

  • whitelist_domains (Optional[List[str]]) – List of allowed email domains.

Returns:

The validated email address if valid, None if invalid.

Return type:

Optional[str]

Examples

>>> django_validate_email("valid@email.com")
'valid@email.com'
>>> django_validate_email("valid@email.com", whitelist_domains=["email.com"])
'valid@email.com'
>>> print(django_validate_email("valid@email.com", whitelist_domains=["other.com"]))
None
data_manipulation.django_.django_validate_phone(phone: str, region: str | None = None) str | None[source]

Validates and formats a phone number using Django’s phone number field.

Parameters:
  • phone (str) – Phone number to validate.

  • region (Optional[str], optional) – Region code for parsing local numbers. Defaults to None.

Returns:

The phone number in E.164 format if valid, None if invalid.

Return type:

Optional[str]

Examples

>>> django_validate_phone("+1234567890")
'+1234567890'
>>> print(django_validate_phone("invalid"))
None

Note

Requires django-phonenumber-field package to be installed.

data_manipulation.django_.django_validate_url(url: str, allowed_schemes: List[str] | None = None) str | None[source]

Validates a URL using Django’s URL validator.

Parameters:
  • url (str) – URL to validate.

  • allowed_schemes (Optional[List[str]]) – List of allowed URL schemes (e.g., [‘http’, ‘https’]).

Returns:

The validated URL if valid, None if invalid.

Return type:

Optional[str]

data_manipulation.django_.get_django_countries_dict() Tuple[Dict[str, str], Dict[str, str]][source]

Retrieves dictionaries mapping country codes to names and vice versa.

Returns:

A tuple containing:
  • code_name: Dict mapping country codes to uppercase country names

  • name_code: Dict mapping uppercase country names to codes

Return type:

Tuple[Dict[str, str], Dict[str, str]]

Raises:

ImportError – If django-countries package is not installed.

data_manipulation.django_.init_django(django_dir: str, project_name: str | None = None) None[source]

Initializes Django environment for external Python scripts or notebooks.

Parameters:
  • django_dir (str) – Path to Django project directory.

  • project_name (Optional[str]) – Name of the Django project. If None, will check DJANGO_PROJECT env var.

Raises:
  • ValueError – If project_name is not provided and DJANGO_PROJECT environment variable is not set.

  • ImportError – If Django cannot be imported or setup fails.

data_manipulation.flask_ module

data_manipulation.geopandas_ module

data_manipulation.geopandas_.haversine_np(longitude1: NumericType, latitude1: NumericType, longitude2: NumericType, latitude2: NumericType) NumericType[source]

Calculates the great circle distance between two points on Earth.

Parameters:
  • longitude1 (float | ndarray) – Longitude of the first point(s) in decimal degrees.

  • latitude1 (float | ndarray) – Latitude of the first point(s) in decimal degrees.

  • longitude2 (float | ndarray) – Longitude of the second point(s) in decimal degrees.

  • latitude2 (float | ndarray) – Latitude of the second point(s) in decimal degrees.

Returns:

Distance between the points in kilometers.

Return type:

float | ndarray

Raises:

ValueError – If latitude values are outside [-90, 90] or longitude values outside [-180, 180].

Examples

>>> haversine_np(-0.127758, 51.507351, 103.819836, 1.352083)  # London to Singapore
10880.39...
>>> haversine_np(np.array([-0.127758]), np.array([51.507351]),
...             np.array([103.819836]), np.array([1.352083]))
array([10880.39...])

Note

Uses the Haversine formula to calculate great circle distances. Earth radius is assumed to be 6371 km. Supports both scalar and numpy array inputs for vectorized calculations.

data_manipulation.kerberos_ module

data_manipulation.kerberos_.keytab_valid(keytab_filepath: str, principal_name: str) CompletedProcess[str][source]

Validates a Kerberos keytab file using kinit.

Parameters:
  • keytab_filepath (str) – Path to the Kerberos keytab file.

  • principal_name (str) – Kerberos principal name associated with the keytab.

Returns:

Result of the kinit command execution containing:
  • returncode: 0 if successful, non-zero if failed

  • stdout: Standard output from the command

  • stderr: Standard error from the command

Return type:

subprocess.CompletedProcess[str]

Raises:
  • ValueError – If keytab file doesn’t exist or inputs are invalid

  • FileNotFoundError – If kinit command is not available

  • subprocess.SubprocessError – If command execution fails

Examples

>>> result = keytab_valid("/path/to/keytab", "user@REALM.COM")
>>> result.returncode == 0  # True if keytab is valid
True

data_manipulation.mysql_connector_python_ module

exception data_manipulation.mysql_connector_python_.DatabaseError[source]

Bases: Exception

Custom exception for database-related errors.

data_manipulation.mysql_connector_python_.create_connection_pool(host: str, dbname: str, user: str, password: str, port: int = 3306, pool_size: int = 5) Any[source]

Creates a connection pool for MySQL database.

Parameters:
  • host (str) – Database host address.

  • dbname (str) – Database name to connect to.

  • user (str) – Username for authentication.

  • password (str) – Password for authentication.

  • port (int, optional) – Database port number. Defaults to 3306.

  • pool_size (int, optional) – Size of the connection pool. Defaults to 5.

Returns:

Database connection pool object.

Return type:

Any

Raises:

DatabaseError – If pool creation fails.

data_manipulation.mysql_connector_python_.execute_query(connection: Any, sql_query: str, data: dict | tuple | None = None, fetch: bool = False) int | None | List[Dict[str, Any]][source]

Executes a MySQL query with parameters.

Parameters:
  • connection (ConnectionType) – Active MySQL connection.

  • sql_query (str) – SQL query to execute.

  • data (Optional[Union[dict, tuple]], optional) – Parameters for the SQL query.

  • fetch (bool, optional) – Whether to fetch results. Defaults to False.

Returns:

  • For INSERT: Last inserted row ID

  • For SELECT: List of dictionaries containing the results

  • None for other operations or on failure

Return type:

Union[Optional[int], List[Dict[str, Any]], None]

Raises:

DatabaseError – If query execution fails.

data_manipulation.mysql_connector_python_.get_connection(host: str, dbname: str, user: str, password: str, port: int = 3306, use_pool: bool = False, pool: Any | None = None) Generator[Any, None, None][source]

Context manager for database connections.

Parameters:
  • host (str) – Database host address.

  • dbname (str) – Database name to connect to.

  • user (str) – Username for authentication.

  • password (str) – Password for authentication.

  • port (int, optional) – Database port number. Defaults to 3306.

  • use_pool (bool, optional) – Whether to use connection pooling. Defaults to False.

  • pool (Optional[Any], optional) – Existing connection pool.

Yields:

ConnectionType – Database connection object.

Raises:

DatabaseError – If connection fails.

data_manipulation.openldap_ module

data_manipulation.openldap_.ldapsearch(search_base: str, ldap_uri: str, bind_dn: str, password: str, search_filter: str) CompletedProcess[source]

Executes an LDAP search query using the ldapsearch command.

Parameters:
  • search_base (str) – Base DN for the search operation.

  • ldap_uri (str) – LDAP server URI (e.g., “ldap://example.com:389”).

  • bind_dn (str) – Distinguished Name (DN) for binding to the LDAP server.

  • password (str) – Password for authentication.

  • search_filter (str) – LDAP search filter (e.g., “(objectClass=person)”).

Returns:

Result of the ldapsearch command execution.

Return type:

subprocess.CompletedProcess

Raises:
  • ValueError – If any input parameters are empty or invalid.

  • subprocess.SubprocessError – If the ldapsearch command fails to execute.

  • FileNotFoundError – If ldapsearch command is not available.

Examples

>>> result = ldapsearch(
...     "dc=example,dc=com",
...     "ldap://ldap.example.com",
...     "cn=admin,dc=example,dc=com",
...     "password",
...     "(objectClass=person)"
... )
>>> result.returncode == 0  # True if search was successful
True
data_manipulation.openldap_.validate_ldap_uri(uri: str) bool[source]

Validate LDAP URI format.

data_manipulation.pandas_ module

data_manipulation.pandas_.add_type_columns(dataframe: DataFrame) DataFrame[source]

Adds type information columns for each column in the dataframe.

Parameters:

dataframe (pd.DataFrame) – Input dataframe to analyze.

Returns:

DataFrame with additional type columns. For each original column ‘X’,

adds a column ‘X_type’ containing the Python type of each value.

Return type:

pd.DataFrame

Raises:

TypeError – If input is not a pandas DataFrame.

Examples

>>> df = pd.DataFrame({'a': [1, 2], 'b': ['x', 'y']})
>>> add_type_columns(df)
   a        a_type  b        b_type
0  1  <class 'int'>  x  <class 'str'>
1  2  <class 'int'>  y  <class 'str'>
data_manipulation.pandas_.aggregate_set_without_none(column: Series, nested_set: bool = False) set[source]

Creates a set from a series, excluding None values.

Parameters:
  • column (pd.Series) – Series to aggregate into a set.

  • nested_set (bool, optional) – Whether to handle nested sets. Defaults to False.

Returns:

Set containing non-None values from the series.

Return type:

set

Examples

>>> s = pd.Series([1, None, 2, None, 3])
>>> aggregate_set_without_none(s)
{1.0, 2.0, 3.0, nan, nan}
data_manipulation.pandas_.chunking_dataframe(dataframe: DataFrame, chunk_size: int) List[DataFrame][source]

Splits a dataframe into smaller chunks of specified size.

Parameters:
  • dataframe (pd.DataFrame) – DataFrame to split.

  • chunk_size (int) – Maximum number of rows in each chunk.

Returns:

List of DataFrames, each containing at most chunk_size rows.

Return type:

List[pd.DataFrame]

Raises:

TypeError – If dataframe is not a pandas DataFrame or chunk_size is not an integer.

Examples

>>> df = pd.DataFrame({'a': range(5)})
>>> chunks = chunking_dataframe(df, 2)
>>> [len(chunk) for chunk in chunks]
[2, 2, 1]
data_manipulation.pandas_.clean_none(dataframe: DataFrame, nan_to_none: bool = True, clean_variation: bool = True, none_variations: List[str] = []) DataFrame[source]

Standardizes None values in a dataframe.

Parameters:
  • dataframe (pd.DataFrame) – Input DataFrame.

  • nan_to_none (bool, optional) – Convert NaN to None. Defaults to True.

  • clean_variation (bool, optional) – Clean common None variations. Defaults to True.

  • none_variations (List[str], optional) – Additional None variations to clean. Defaults to [].

Returns:

DataFrame with standardized None values.

Return type:

pd.DataFrame

Note

Deprecated as of pandas 1.3.0.

data_manipulation.pandas_.compare_all_list_items(list_: List[Any]) DataFrame[source]

Creates a DataFrame comparing all possible pairs of items in a list.

Parameters:

list (List[Any]) – Input list containing items to compare

Returns:

DataFrame with columns:
  • item1: First item in comparison

  • item2: Second item in comparison

  • item1_eq_item2: Boolean indicating if items are equal

Return type:

pd.DataFrame

Raises:

TypeError – If input is not a list

Examples

>>> compare_all_list_items([1, 1, 2, 3])
  item1 item2  item1_eq_item2
0     1     1            True
1     1     2           False
2     1     3           False
3     1     2           False
4     1     3           False
5     2     3           False
data_manipulation.pandas_.compare_dataframes(dataframe1: DataFrame, dataframe2: DataFrame) None[source]

Compares two dataframes and prints detailed comparison information.

Parameters:
  • dataframe1 (pd.DataFrame) – First DataFrame to compare.

  • dataframe2 (pd.DataFrame) – Second DataFrame to compare.

Raises:

TypeError – If either input is not a pandas DataFrame.

Note

Prints comparison results including: - Row counts - Column-wise comparison of non-null values - Detailed analysis of differences when counts don’t match

data_manipulation.pandas_.config_pandas_display() None[source]

Configures pandas display settings for better output readability.

Sets the following pandas display options:
  • display.max_columns: 500

  • display.max_colwidth: 500

  • display.expand_frame_repr: True

Note

Modifies global pandas settings.

data_manipulation.pandas_.dtypes_dictionary(dataframe: DataFrame) Dict[type, List[str]][source]

Creates a dictionary mapping Python types to column names in a DataFrame.

Parameters:

dataframe (pd.DataFrame) – Input DataFrame to analyze

Returns:

Dictionary where:
  • keys are Python types (e.g., str, int, float)

  • values are lists of column names with that type

Return type:

Dict[type, List[str]]

Raises:
  • TypeError – If input is not a DataFrame

  • ValueError – If DataFrame is empty

Examples

>>> df = pd.DataFrame({
...     'int_': [0],
...     'str_': ['0'],
...     'list_': [[]],
...     'dict_': [{}],
...     'none_': [None]
... })
>>> dtypes_dictionary(df)
{
    <class 'int'>: ['int_'],
    <class 'str'>: ['str_'],
    <class 'list'>: ['list_'],
    <class 'dict'>: ['dict_'],
    <class 'NoneType'>: ['none_']
}
data_manipulation.pandas_.head_tail(dataframe: DataFrame, n: int = 5) DataFrame[source]

Returns first and last n rows of a dataframe.

Parameters:
  • dataframe (pd.DataFrame) – Input DataFrame.

  • n (int, optional) – Number of rows from top and bottom. Defaults to 5.

Returns:

Concatenated first and last n rows.

Return type:

pd.DataFrame

Examples

>>> df = pd.DataFrame({'a': range(10)})
>>> len(head_tail(df, 2))
4
data_manipulation.pandas_.index_marks(n_rows: int, chunk_size: int) range[source]

Calculates indices for splitting data into chunks of specified size.

Parameters:
  • n_rows (int) – Total number of rows in the dataset

  • chunk_size (int) – Desired size of each chunk

Returns:

Range object containing indices where chunks should be split

Return type:

range

Raises:

TypeError – If either n_rows or chunk_size is not an integer

Examples

>>> index_marks(10, 3)
range(3, 12, 3)
>>> list(index_marks(10, 3))
[3, 6, 9]

Note

Used internally by chunking_dataframe() to determine split points for breaking large datasets into smaller chunks.

data_manipulation.pandas_.is_running_in_jupyter() bool[source]

Checks if code is running in a Jupyter notebook environment.

Returns:

True if running in Jupyter notebook, False otherwise.

Return type:

bool

Examples

>>> is_running_in_jupyter()
True  # When running in Jupyter
False  # When running in regular Python
data_manipulation.pandas_.print_dataframe_overview(dataframe: DataFrame, stats: bool = False) None[source]

Prints comprehensive overview of DataFrame contents and statistics.

Displays for each column: - Number of unique elements - Number of null elements - Value counts with percentages - Optional statistics (min/max) if stats=True

Parameters:
  • dataframe (pd.DataFrame) – DataFrame to analyze

  • stats (bool, optional) – Whether to include min/max statistics. Defaults to False.

Raises:

TypeError – If input is not a pandas DataFrame

Examples

>>> data = {"int_": [1, 1, 2], "str_": ["a", "b", "b"]}
>>> df = pd.DataFrame(data)
>>> print_dataframe_overview(df)
==============================
int_
==============================
Unique elements: 2
Null elements: 0
    count    percent
int_
1         2  66.666667
2         1  33.333333
==============================
str_
==============================
Unique elements: 2
Null elements: 0
    count    percent
str_
b         2  66.666667
a         1  33.333333
data_manipulation.pandas_.ps_aux_dataframe(ps_aux_commands: str) DataFrame[source]

Converts Linux ps aux command output to a DataFrame.

Parameters:

ps_aux_commands (str) – ps aux command string to execute.

Returns:

DataFrame containing process information with columns matching

ps aux output format.

Return type:

pd.DataFrame

Examples

>>> df = ps_aux_dataframe("ps aux | head")
>>> 'PID' in df.columns
True
data_manipulation.pandas_.series_count(series: Series) DataFrame[source]

Calculates value counts and percentages for a Series.

Enhanced version of series.value_counts() that includes percentage calculations.

Parameters:

series (pd.Series) – Input series to analyze

Returns:

DataFrame with columns:
  • count: Frequency of each unique value

  • percent: Percentage of total (0-100) for each value

Return type:

pd.DataFrame

Raises:

TypeError – If input is not a pandas Series

Examples

>>> values = [1, 1, 1, 2, 2, 3]
>>> s = pd.Series(values)
>>> series_count(s)
   count    percent
1      3  50.000000
2      2  33.333333
3      1  16.666667
data_manipulation.pandas_.series_to_columns(dataframe: DataFrame, column: str) DataFrame[source]

Expands a column containing dictionaries into separate columns.

Parameters:
  • dataframe (pd.DataFrame) – Input dataframe.

  • column (str) – Name of column containing dictionaries to expand.

Returns:

DataFrame with dictionary column expanded into separate columns.

Return type:

pd.DataFrame

Examples

>>> values = [[1, {'a':1, 'b':2}]]
>>> cols = ["c1", "c2"]
>>> df = pd.DataFrame(values, columns=cols)
>>> df
c1                c2
0   1  {'a': 1, 'b': 2}
>>> series_to_columns(df, "c2")
c1  a  b
0   1  1  2
data_manipulation.pandas_.split_dataframe(dataframe: DataFrame, uuid: str, columns: List[str]) Tuple[DataFrame, DataFrame][source]

Splits a dataframe into two based on specified columns while maintaining a linking ID.

Parameters:
  • dataframe (pd.DataFrame) – Input DataFrame to split.

  • uuid (str) – Name of the column containing unique identifiers.

  • columns (List[str]) – List of column names to split into second DataFrame.

Returns:

Tuple containing:
  • First DataFrame with specified columns removed

  • Second DataFrame with only specified columns and uuid

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Raises:

TypeError – If inputs are not of correct types.

Examples

>>> df = pd.DataFrame({'id': [1], 'a': [2], 'b': [3]})
>>> df1, df2 = split_dataframe(df, 'id', ['b'])
>>> df1
   id  a
0   1  2
>>> df2
   b  id
0  3   1
data_manipulation.pandas_.split_left_merged_dataframe(dataframe1: DataFrame, dataframe2: DataFrame, columns: List[str]) Tuple[DataFrame, DataFrame][source]

Splits result of a left merge into matched and unmatched records.

Parameters:
  • dataframe1 (pd.DataFrame) – Left DataFrame for merge.

  • dataframe2 (pd.DataFrame) – Right DataFrame for merge.

  • columns (List[str]) – Columns to merge on.

Returns:

Tuple containing:
  • DataFrame with rows that matched in both inputs

  • DataFrame with rows only present in left input

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Examples

>>> df1 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo', "zen"], 'value': [1, 2, 3, 5, 9]})
>>> df2 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'], 'value': [5, 6, 7, 8]})
>>> df_both, df_left = split_left_merged_dataframe(df1, df2, ["key"])
                count    percent
_merge
both            6  85.714286
left_only       1  14.285714
right_only      0   0.000000
>>> df_both
   key  value  value_y _merge
0  foo      1      5.0   both
1  foo      1      8.0   both
2  bar      2      6.0   both
3  baz      3      7.0   both
4  foo      5      5.0   both
5  foo      5      8.0   both
>>> df_left
   key  value  value_y     _merge
6  zen      9      NaN  left_only
data_manipulation.pandas_.to_excel_keep_url(filepath: str, dataframe: DataFrame) None[source]

Exports DataFrame to Excel while preserving URLs as clickable links.

Handles Excel’s automatic URL conversion and limitations: - Preserves URLs without auto-conversion - Handles Excel’s limit of 65,530 URLs per worksheet - Manages URLs longer than 255 characters

Parameters:
  • filepath (str) – Output Excel file path (including .xlsx extension)

  • dataframe (pd.DataFrame) – DataFrame to export

Raises:
  • TypeError – If filepath is not a string or dataframe is not a DataFrame

  • IOError – If file cannot be written to specified location

Examples

>>> df = pd.DataFrame({
...     'url': ['https://www.example.com'],
...     'data': ['test']
... })
>>> to_excel_keep_url('output.xlsx', df)
Excel exported ...

Note

Uses xlsxwriter engine with specific options to prevent URL auto-conversion. Make sure the target directory exists before calling this function.

data_manipulation.pandas_.useless_columns(dataframe: DataFrame) Tuple[List[str], List[str]][source]

Identifies empty and single-value columns in a DataFrame.

Parameters:

dataframe (pd.DataFrame) – DataFrame to analyze

Returns:

Two lists containing:
  • List of column names that are completely empty

  • List of column names that contain only a single unique value

Return type:

Tuple[List[str], List[str]]

Raises:

TypeError – If input is not a pandas DataFrame

Examples

>>> df = pd.DataFrame({
...     'empty': [None, None],
...     'single': ['A', 'A'],
...     'normal': ['A', 'B']
... })
>>> empty_cols, single_cols = useless_columns(df)
Empty columns: ['empty']
Single value columns: ['single']
>>> print(f"Empty columns: {empty_cols}")
Empty columns: ['empty']
>>> print(f"Single value columns: {single_cols}")
Single value columns: ['single']

Note

Empty columns are those where all values are None/NaN. Single-value columns may be candidates for removal or conversion to constants.

data_manipulation.prometheus_ module

data_manipulation.postgres_ module

data_manipulation.pyspark_ module

data_manipulation.pyspark_.add_dummy_columns(dataframe: DataFrame, columns: List[str], value: str) DataFrame[source]

Adds new columns with default values to a Spark DataFrame.

Parameters:
  • dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame

  • columns (List[str]) – List of column names to add

  • value (str) – Default value for the new columns

Returns:

DataFrame with added columns

Return type:

pyspark.sql.DataFrame

Raises:

TypeError – If arguments are not of correct type - dataframe must be a Spark DataFrame - columns must be a list - value must be a string

Examples

>>> df = spark.createDataFrame([("Alice", 1)], ["name", "id"])
>>> new_df = add_dummy_columns(df, ["age", "city"], "unknown")
>>> new_df.show()
+-----+---+---+------+
| name| id|age|  city|
+-----+---+---+------+
|Alice|  1|unknown|unknown|
+-----+---+---+------+
data_manipulation.pyspark_.column_into_list(dataframe: DataFrame, column: str) List[Any][source]

Extracts values from a DataFrame column into a Python list.

Parameters:
  • dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame

  • column (str) – Name of the column to extract

Returns:

List containing all values from the specified column,

including duplicates

Return type:

List[Any]

Raises:

TypeError – If dataframe is not a Spark DataFrame or column is not a string

Examples

>>> df = spark.createDataFrame([(1,), (2,), (2,)], ["value"])
>>> column_into_list(df, "value")
[1, 2, 2]
data_manipulation.pyspark_.column_into_set(dataframe: DataFrame, column: str) Set[Any][source]

Extracts unique values from a DataFrame column into a Python set.

Parameters:
  • dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame

  • column (str) – Name of the column to extract

Returns:

Set containing unique values from the specified column

Return type:

Set[Any]

Raises:

TypeError – If dataframe is not a Spark DataFrame or column is not a string

Examples

>>> df = spark.createDataFrame([(1,), (2,), (2,)], ["value"])
>>> column_into_set(df, "value")
{1, 2}
data_manipulation.pyspark_.columns_prefix(dataframe: DataFrame, prefix: str) DataFrame[source]

Adds a prefix to all column names in a DataFrame.

Parameters:
  • dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame

  • prefix (str) – Prefix to add to column names

Returns:

DataFrame with renamed columns

Return type:

pyspark.sql.DataFrame

Raises:

TypeError – If dataframe is not a Spark DataFrame or prefix is not a string

Examples

>>> df = spark.createDataFrame([("Alice", 1)], ["name", "id"])
>>> new_df = columns_prefix(df, "user_")
>>> new_df.show()
+---------+-------+
|user_name|user_id|
+---------+-------+
|    Alice|      1|
+---------+-------+
data_manipulation.pyspark_.columns_statistics(dataframe: DataFrame, n: int = 10) Tuple[List[str], List[str]][source]

Analyzes column statistics and identifies empty and single-value columns.

Performs comprehensive analysis of each column including: - Value counts - Empty value detection - Single value detection - Basic statistics

Parameters:
  • dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame

  • n (int, optional) – Number of top values to display for each column. Defaults to 10.

Returns:

Two lists containing:
  • List of empty column names

  • List of single-value column names

Return type:

Tuple[List[str], List[str]]

Raises:

TypeError – If dataframe is not a Spark DataFrame

Examples

>>> df = spark.createDataFrame([
...     ("Alice", None),
...     ("Alice", None)
... ], ["name", "email"])
>>> empty_cols, single_cols = columns_statistics(df)
>>> print(f"Empty columns: {empty_cols}")
Empty columns: ['email']
>>> print(f"Single value columns: {single_cols}")
Single value columns: ['name']
data_manipulation.pyspark_.config_spark_local(autoset: bool = True) None[source]

Configures Spark for local execution with optimized settings based on system resources.

Automatically calculates and sets optimal Spark configuration parameters based on: - Available CPU cores - System memory - Executor allocation - Memory distribution

Parameters:

autoset (bool, optional) – Whether to automatically apply the configuration. If False, only prints recommended settings. Defaults to True.

Examples

>>> config_spark_local()
Here is the current computer specs ...
executor_per_node: 1
spark_executor_instances: 1
total_executor_memory: 30
spark_executor_memory: 27
memory_overhead: 3
spark_default_parallelism: 10
spark.sql.execution.arrow.pyspark.enabled recommended by Koalas ...
spark auto-configured ...

Note

Configuration includes: - Executor cores and memory - Driver cores and memory - Memory overhead - Default parallelism - Shuffle partitions - Arrow optimization for PySpark

data_manipulation.pyspark_.describe(dataframe: DataFrame) None[source]

Prints comprehensive information about a DataFrame.

Displays: - DataFrame type - Number of columns - Number of rows - Schema information

Parameters:

dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame

Raises:

TypeError – If dataframe is not a Spark DataFrame

Examples

>>> df = spark.createDataFrame([("Alice", 1)], ["name", "id"])
>>> describe(df)
The dataframe: <class 'pyspark.sql.dataframe.DataFrame'>
Number of columns: 2
Number of rows: 1
root
 |-- name: string (nullable = true)
 |-- id: long (nullable = true)
data_manipulation.pyspark_.group_count(dataframe: DataFrame, columns: str | List[str], n: int | float = 10) DataFrame[source]

Performs group by operation and calculates count and percentage for each group.

Parameters:
  • dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame

  • columns (Union[str, List[str]]) – Column(s) to group by

  • n (Union[int, float], optional) – Number of top groups to return. Use float(‘inf’) for all groups. Defaults to 10.

Returns:

DataFrame with columns:
  • Group by column(s)

  • count: Count of records in each group

  • percent: Percentage of total records in each group

Return type:

pyspark.sql.DataFrame

Raises:

TypeError – If arguments are not of correct type

Examples

>>> df = spark.createDataFrame([
...     (1, 'A'), (1, 'B'), (2, 'A')
... ], ["id", "category"])
>>> group_count(df, ["id"]).show()
+---+-----+-------+
| id|count|percent|
+---+-----+-------+
|  1|    2|   66.7|
|  2|    1|   33.3|
+---+-----+-------+
data_manipulation.pyspark_.rename(dataframe: DataFrame, columns: Dict[str, str]) DataFrame[source]

Renames multiple columns in a DataFrame using a mapping dictionary.

Parameters:
  • dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame

  • columns (Dict[str, str]) – Dictionary mapping old column names to new names

Returns:

DataFrame with renamed columns

Return type:

pyspark.sql.DataFrame

Raises:

TypeError – If dataframe is not a Spark DataFrame or columns is not a dict

Examples

>>> df = spark.createDataFrame([("Alice", 1)], ["_1", "_2"])
>>> new_df = rename(df, {"_1": "name", "_2": "id"})
>>> new_df.show()
+-----+---+
| name| id|
+-----+---+
|Alice|  1|
+-----+---+

data_manipulation.smtplib_ module

data_manipulation.smtplib_.send_email(logname: str, message_subject: str, message_sender: str, message_receiver: str, html: str, smtp_address: str, smtp_port: int = 587, smtp_username: str | None = None, smtp_password: str | None = None) bool[source]

Sends an HTML email using SMTP with TLS and logs the operation.

Parameters:
  • logname (str) – Path to the log file

  • message_subject (str) – Subject line of the email

  • message_sender (str) – Email address of the sender

  • message_receiver (str) – Email address of the recipient

  • html (str) – HTML content of the email body

  • smtp_address (str) – SMTP server address (e.g., ‘smtp.gmail.com’)

  • smtp_port (int, optional) – SMTP server port. Defaults to 587.

  • smtp_username (str, optional) – SMTP authentication username

  • smtp_password (str, optional) – SMTP authentication password

Returns:

True if email was sent successfully, False otherwise

Return type:

bool

Raises:
  • ValueError – If required parameters are empty or invalid

  • SMTPException – If email sending fails

  • IOError – If log file cannot be accessed

data_manipulation.sqlalchemy_ module

exception data_manipulation.sqlalchemy_.DatabaseConnectionError[source]

Bases: Exception

Custom exception for database connection errors.

data_manipulation.sqlalchemy_.create_sqlalchemy_engine(drivername: str, host: str, dbname: str, user: str, password: str, port: int = 3306, pool_size: int = 5, max_overflow: int = 10, pool_timeout: int = 30, connect_timeout: int = 10, ssl_ca: str | None = None) Engine[source]

Creates and tests a SQLAlchemy engine for database operations.

Parameters:
  • drivername (str) – Database driver name. Supported options include: - ‘mysql+mysqlconnector’ - ‘mysql+pymysql’ - ‘postgresql+psycopg2’ - ‘mssql+pyodbc’ - ‘oracle+cx_oracle’ - ‘sqlite3’

  • host (str) – Database server hostname or IP address

  • dbname (str) – Name of the target database

  • user (str) – Database username for authentication

  • password (str) – Database password for authentication

  • port (int, optional) – Database server port number. Defaults to 3306.

  • pool_size (int, optional) – The size of the connection pool. Defaults to 5.

  • max_overflow (int, optional) – Maximum number of connections above pool_size. Defaults to 10.

  • pool_timeout (int, optional) – Timeout for getting a connection from pool. Defaults to 30.

  • connect_timeout (int, optional) – Timeout for database connections. Defaults to 10.

  • ssl_ca (Optional[str], optional) – Path to SSL CA certificate. Defaults to None.

Returns:

Configured database engine object

Return type:

sqlalchemy.engine.base.Engine

Raises:

DatabaseConnectionError – If engine creation or connection test fails

Examples

>>> engine = create_sqlalchemy_engine(
...     drivername='postgresql+psycopg2',
...     host='localhost',
...     dbname='mydb',
...     user='admin',
...     password='secret',
...     port=5432
... )
# Logs "create_sqlalchemy_engine: True" on success
# or "create_sqlalchemy_engine: False (error_message)" on failure

Note

The function automatically tests the connection upon creation and logs the result using loguru. A successful connection will be logged as info, while failures will be logged as errors with the specific exception message.

data_manipulation.sqlalchemy_.create_sqlalchemy_url(drivername: str, host: str, dbname: str, user: str, password: str, port: int = 3306, query: Dict[str, Any] | None = None) URL[source]

Creates a SQLAlchemy URL object for database connection.

Parameters:
  • drivername (str) – Database driver name. Supported options include: - ‘mysql+mysqlconnector’ - ‘mysql+pymysql’ - ‘postgresql+psycopg2’ - ‘mssql+pyodbc’ - ‘oracle+cx_oracle’ - ‘sqlite3’

  • host (str) – Database server hostname or IP address

  • dbname (str) – Name of the target database

  • user (str) – Database username for authentication

  • password (str) – Database password for authentication

  • port (int, optional) – Database server port number. Defaults to 3306.

  • query (Optional[Dict[str, Any]], optional) – Additional connection parameters. Useful for SSL configuration. Defaults to None.

Returns:

Configured URL object for database connection

Return type:

sqlalchemy.engine.url.URL

Examples

>>> url = create_sqlalchemy_url(
...     drivername='postgresql+psycopg2',
...     host='localhost',
...     dbname='mydb',
...     user='admin',
...     password='secret',
...     port=5432
... )
>>> str(url)
'postgresql+psycopg2://admin:secret@localhost:5432/mydb'
data_manipulation.sqlalchemy_.dispose_engine(engine: Engine) None[source]

Safely dispose of the SQLAlchemy engine and its connection pool.

Parameters:

engine (Engine) – The SQLAlchemy engine to dispose

Module contents