data_manipulation package¶

Submodules¶

data_manipulation.base module¶

data_manipulation.base.clean_string(string: str, remove_parenthesis: bool = False, remove_brackets: bool = False) → str[source]¶

Cleans and standardizes input string.

Parameters:

string (str) – String to clean.
remove_parenthesis (bool) – Whether to remove content within parentheses.
remove_brackets (bool) – Whether to remove content within square brackets.

Returns:

Uppercase cleaned string with standardized spacing.

Return type:

str

Raises:

ValueError – If input string is None or empty.

Examples

>>> clean_string(" sHawn  tesT ")
'SHAWN TEST'
>>> clean_string("shawn ( te  st )")
'SHAWN ( TE ST )'
>>> clean_string("shawn ( te  st )", remove_parenthesis=True)
'SHAWN'
>>> clean_string("shawn [ te  st ]", remove_brackets=True)
'SHAWN'

data_manipulation.base.create_encode_url(url: str, query_params: Dict[str, Any] | None = None) → str[source]¶

Creates an encoded URL with query parameters.

Parameters:

url (str) – Base URL.
query_params (dict, optional) – Dictionary of URL query parameters. Defaults to {}.

Returns:

Encoded URL with query parameters.

Return type:

str

Raises:

ValueError – If URL is invalid.

data_manipulation.base.delete_list_indices(list_: list, indices: List[int]) → None[source]¶

Deletes multiple indices from a list in-place.

Parameters:

list (list) – Original list to modify.
indices (list) – List of indices to delete.

Examples

>>> values = [0, 1, 2, 3, 4]
>>> delete_list_indices(values, [1, 3])
>>> values
[0, 2, 4]

data_manipulation.base.deprecated(func)[source]¶: Decorator to mark functions as deprecated.

data_manipulation.base.get_country_code_variation() → dict[source]¶

Returns a dictionary mapping country codes to their name variations.

Returns:: Dictionary with country codes as keys and lists of country name variations as values.
Return type:: dict

data_manipulation.base.get_country_name_variation() → dict[source]¶

Returns a dictionary mapping country names to their codes.

Returns:: Dictionary with country names as keys and country codes as values.
Return type:: dict

data_manipulation.base.get_none_variation() → List[None | str][source]¶

Returns a list of common variations of None/null values.

Returns:: List containing None and various string representations of null values.
Return type:: list

Examples

>>> get_none_variation()
[None, 'NONE', 'NONe', 'NOnE', 'NOne', 'NoNE', 'NoNe', 'NonE', 'None', 'nONE', 'nONe', 'nOnE', 'nOne', 'noNE', 'noNe', 'nonE', 'none', 'NULL', 'NULl', 'NUlL', 'NUll', 'NuLL', 'NuLl', 'NulL', 'Null', 'nULL', 'nULl', 'nUlL', 'nUll', 'nuLL', 'nuLl', 'nulL', 'null', 'NA', 'Na', 'nA', 'na', 'N.A', 'N.a', 'N.A', 'N.a', 'n.A', 'n.a', 'n.A', 'n.a', 'N.A.', 'N.A.', 'N.a.', 'N.a.', 'N.A.', 'N.A.', 'N.a.', 'N.a.', 'n.A.', 'n.A.', 'n.a.', 'n.a.', 'n.A.', 'n.A.', 'n.a.', 'n.a.', 'NAN', 'NAn', 'NaN', 'Nan', 'nAN', 'nAn', 'naN', 'nan', 'NIL', 'NIl', 'NiL', 'Nil', 'nIL', 'nIl', 'niL', 'nil']

data_manipulation.base.get_path_files(path: str | Path, keywords: List[str]) → List[str][source]¶

Returns sorted list of files from given path that contain specified keywords.

Parameters:

path (str) – Directory path to search.
keywords (list) – List of keywords to match in filenames.

Returns:

Sorted list of matching filenames.

Return type:

List[str]

Examples

>>> get_path_files("test_base_folder", ["py"])
['test1.py', 'test2.py', 'test3.py', 'test4.py', 'test5.py']

data_manipulation.base.get_string_case_combination(str_: str) → List[str][source]¶

Generates all possible case combinations of a string.

Parameters:: str (str) – Input string to generate combinations for.
Returns:: List of all possible case combinations.
Return type:: List[str]

Examples

>>> get_string_case_combination("abc")
['ABC', 'ABc', 'AbC', 'Abc', 'aBC', 'aBc', 'abC', 'abc']

data_manipulation.base.list_to_file(filepath: str | Path, list_: List, newline: bool = True) → None[source]¶

Writes list contents to a file.

Parameters:

filepath (str) – Path to output file.
list (list) – List of values to write.
newline (bool, optional) – Whether to add newline after each item. Defaults to True.

Examples

>>> list_to_file("test.txt", [1, 2, 3])

data_manipulation.base.list_tuple_without_none(list_tuple: List | Tuple) → List | Tuple[source]¶

Removes None variations from a list or tuple.

Parameters:: list_tuple – Input list or tuple to clean.
Returns:: Cleaned list or tuple.
Return type:: Union[List, Tuple]
Raises:: TypeError – If input is not a list or tuple.

Examples

>>> list_tuple_without_none(["a", "none"])
['a']
>>> list_tuple_without_none(("a", "none"))
('a',)

data_manipulation.base.parse_ps_aux(ps_aux_commands: str) → List[List[str]][source]¶

Parses Linux ps aux command output into a list of records.

Parameters:: ps_aux_commands (str) – Linux ps aux command string.
Returns:: List of process records.
Return type:: List[List[str]]

Examples

>>> # parse_ps_aux("ps aux | egrep -i '%cpu|anaconda3' | head")

data_manipulation.base.remove_path_file(path: str | Path, keyword: str, n: int = 2) → None[source]¶

Removes all but the n newest files matching the keyword.

Parameters:

path (str) – Directory path.
keyword (str) – Keyword to match in filenames.
n (int, optional) – Number of newest files to keep. Defaults to 2.

Examples

>>> remove_path_file("test_base_folder", ".py")

data_manipulation.base.string_boolean_to_int(boolean_str_rep: str) → int[source]¶

Converts string boolean representations to integers.

Parameters:: boolean_str_rep (str) – String representation of boolean value.
Returns:: 1 for true values, 0 for false values.
Return type:: int

Examples

>>> string_boolean_to_int("true")
1
>>> string_boolean_to_int("True")
1

data_manipulation.base.string_dlt_to_dlt(dlt_str_rep: str) → Dict | List | Tuple[source]¶

Converts string representation of data structures to actual Python objects.

Parameters:: dlt_str_rep (str) – String representation of dictionary/list/tuple.
Returns:: Converted Python data structure.
Return type:: Union[Dict, List, Tuple]

Examples

>>> string_dlt_to_dlt("[1, 2, 3]")
[1, 2, 3]
>>> string_dlt_to_dlt("{'a': 1, 'b': 2}")
{'a': 1, 'b': 2}
>>> string_dlt_to_dlt("('1', '2', '3')")
('1', '2', '3')

data_manipulation.base.string_str_to_str(string_str_rep: str) → str[source]¶

Converts string representation to a clean string by removing quotes.

Parameters:: string_str_rep (str) – String representation to clean.
Returns:: Cleaned string with outer quotes removed.
Return type:: str

Examples

>>> string_str_to_str("'test'")
'test'
>>> string_str_to_str('"test"')
'test'

data_manipulation.beautifulsoup_ module¶

data_manipulation.beautifulsoup_.build_soup(url: str, features: str = 'lxml', to_preprocess: bool = True, timeout: int = 10, headers: dict | None = None) → BeautifulSoup | None[source]¶

Creates a BeautifulSoup object from a given URL.

Parameters:

url (str) – URL to fetch and parse.
features (str, optional) – Parser to use. Defaults to “lxml”.
to_preprocess (bool, optional) – Whether to preprocess the HTML. Defaults to True.
timeout (int, optional) – Request timeout in seconds. Defaults to 10.
headers (Optional[dict], optional) – Custom headers for the request. Defaults to None.

Returns:

Parsed BeautifulSoup object, or None if request fails.

Return type:

Optional[BeautifulSoup]

Examples

>>> a = build_soup("https://google.com")
>>> type(a)
<class 'bs4.BeautifulSoup'>

Note

Requires requests and beautifulsoup4 packages.

data_manipulation.beautifulsoup_.preprocess(html: str) → str | None[source]¶

Removes whitespaces and newline characters from HTML string.

Parameters:: html (str) – HTML string to be cleaned.
Returns:: Cleaned HTML string with normalized whitespace.
Return type:: Optional[str]

Examples

>>> a = "<html>   <p> Something </p>    </html> "
>>> preprocess(a)
'<html><p>Something</p></html>'

Note

Reference: https://stackoverflow.com/questions/23241641

data_manipulation.boto3_ module¶

data_manipulation.boto3_.list_s3_bucket_files(bucket: str, to_dateframe: bool = False, prefix: str | None = None) → List[str] | pd.DataFrame[source]¶

Lists all files in an S3 bucket.

Parameters:

bucket (str) – Name of the S3 bucket.
to_dateframe (bool, optional) – Whether to return results as pandas DataFrame. Defaults to False.
prefix (Optional[str], optional) – Filter results to files with this prefix. Defaults to None.

Returns:

List of file keys or DataFrame containing file keys.: If to_dateframe is True, returns DataFrame with ‘key’ column. If to_dateframe is False, returns list of file keys.

Return type:

Union[List[str], pd.DataFrame]

Examples

>>> files = list_s3_bucket_files('my-bucket')
>>> type(files)
<class 'list'>

>>> df = list_s3_bucket_files('my-bucket', to_dateframe=True)
>>> type(df)
<class 'pandas.core.frame.DataFrame'>

data_manipulation.boto3_.send_aws_ses_email(sender: str, recipient: List[str], subject: str, body_text: str, body_type: str, ses_client: Any, attachment: str | None = None, max_attachment_size: int = 10485760, max_retries: int = 3) → Dict[str, Any] | None[source]¶

Sends an email using AWS SES service.

Parameters:

sender (str) – Sender’s email address.
recipient (list) – List of recipient email addresses.
subject (str) – Email subject line.
body_text (str) – Email body content.
body_type (str) – MIME type of email body (e.g., ‘plain’, ‘html’).
ses_client – AWS SES client instance.
attachment (str, optional) – Path to file to attach. Defaults to None.
max_attachment_size (int, optional) – Maximum allowed attachment size in bytes. Defaults to 10MB.
max_retries (int, optional) – Maximum number of retry attempts for failed sends. Defaults to 3.

Returns:

AWS SES response dictionary if successful, None if failed.

Return type:

dict

Note

Reference: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ses/client/send_raw_email.html

data_manipulation.cryptography_ module¶

data_manipulation.cryptography_.decrypt_fernet_data(keypath: str | Path, filepath: str | Path) → bytes[source]¶

Decrypts a file using Fernet symmetric encryption.

Parameters:

keypath (str) – Path to the Fernet key file.
filepath (str) – Path to the encrypted file.

Returns:

Decrypted data.

Return type:

str

Raises:

FileNotFoundError – If key file or input file doesn’t exist
InvalidToken – If the key is invalid or data is corrupted
TypeError – If input types are incorrect

data_manipulation.cryptography_.encrypt_fernet_file(keypath: str | Path, filepath: str | Path) → bytes[source]¶

Encrypts a file using Fernet symmetric encryption.

Parameters:

keypath (str) – Path to the Fernet key file.
filepath (str) – Path to the file to be encrypted.

Returns:

Encrypted data.

Return type:

str

Raises:

TypeError – If keypath or filepath are not strings.

Examples

>>> encrypted = encrypt_fernet_file('key.txt', 'data.txt')
>>> isinstance(encrypted, str)
True

data_manipulation.cryptography_.generate_fernet_key(output_directory: str | Path, output_filename: str) → bytes[source]¶

Generates and saves a Fernet encryption key.

Parameters:

output_directory (str) – Directory path where the key file will be saved.
output_filename (str) – Name of the key file to be created.

Returns:

Generated Fernet key in bytes format.

Return type:

bytes

Raises:

Exception – If key generation or file writing fails.

Examples

>>> key = generate_fernet_key('/path/to/keys', 'encryption.key')
>>> isinstance(key, bytes)
True

data_manipulation.django_ module¶

data_manipulation.django_.django_validate_email(email: str, whitelist_domains: List[str] | None = None) → str | None[source]¶

Validates an email address using Django’s validator.

Parameters:

email (str) – Email address to validate.
whitelist_domains (Optional[List[str]]) – List of allowed email domains.

Returns:

The validated email address if valid, None if invalid.

Return type:

Optional[str]

Examples

>>> django_validate_email("valid@email.com")
'valid@email.com'
>>> django_validate_email("valid@email.com", whitelist_domains=["email.com"])
'valid@email.com'
>>> print(django_validate_email("valid@email.com", whitelist_domains=["other.com"]))
None

data_manipulation.django_.django_validate_phone(phone: str, region: str | None = None) → str | None[source]¶

Validates and formats a phone number using Django’s phone number field.

Parameters:

phone (str) – Phone number to validate.
region (Optional[str], optional) – Region code for parsing local numbers. Defaults to None.

Returns:

The phone number in E.164 format if valid, None if invalid.

Return type:

Optional[str]

Examples

>>> django_validate_phone("+1234567890")
'+1234567890'
>>> print(django_validate_phone("invalid"))
None

Note

Requires django-phonenumber-field package to be installed.

data_manipulation.django_.django_validate_url(url: str, allowed_schemes: List[str] | None = None) → str | None[source]¶

Validates a URL using Django’s URL validator.

Parameters:

url (str) – URL to validate.
allowed_schemes (Optional[List[str]]) – List of allowed URL schemes (e.g., [‘http’, ‘https’]).

Returns:

The validated URL if valid, None if invalid.

Return type:

Optional[str]

data_manipulation.django_.get_django_countries_dict() → Tuple[Dict[str, str], Dict[str, str]][source]¶

Retrieves dictionaries mapping country codes to names and vice versa.

Returns:

A tuple containing:

code_name: Dict mapping country codes to uppercase country names
name_code: Dict mapping uppercase country names to codes

Return type:

Tuple[Dict[str, str], Dict[str, str]]

Raises:

ImportError – If django-countries package is not installed.

data_manipulation.django_.init_django(django_dir: str, project_name: str | None = None) → None[source]¶

Initializes Django environment for external Python scripts or notebooks.

Parameters:

django_dir (str) – Path to Django project directory.
project_name (Optional[str]) – Name of the Django project. If None, will check DJANGO_PROJECT env var.

Raises:

ValueError – If project_name is not provided and DJANGO_PROJECT environment variable is not set.
ImportError – If Django cannot be imported or setup fails.

data_manipulation.flask_ module¶

data_manipulation.geopandas_ module¶

data_manipulation.geopandas_.haversine_np(longitude1: NumericType, latitude1: NumericType, longitude2: NumericType, latitude2: NumericType) → NumericType[source]¶

Calculates the great circle distance between two points on Earth.

Parameters:

longitude1 (float | ndarray) – Longitude of the first point(s) in decimal degrees.
latitude1 (float | ndarray) – Latitude of the first point(s) in decimal degrees.
longitude2 (float | ndarray) – Longitude of the second point(s) in decimal degrees.
latitude2 (float | ndarray) – Latitude of the second point(s) in decimal degrees.

Returns:

Distance between the points in kilometers.

Return type:

float | ndarray

Raises:

ValueError – If latitude values are outside [-90, 90] or longitude values outside [-180, 180].

Examples

>>> haversine_np(-0.127758, 51.507351, 103.819836, 1.352083)  # London to Singapore
10880.39...
>>> haversine_np(np.array([-0.127758]), np.array([51.507351]),
...             np.array([103.819836]), np.array([1.352083]))
array([10880.39...])

Note

Uses the Haversine formula to calculate great circle distances. Earth radius is assumed to be 6371 km. Supports both scalar and numpy array inputs for vectorized calculations.

data_manipulation.kerberos_ module¶

data_manipulation.kerberos_.keytab_valid(keytab_filepath: str, principal_name: str) → CompletedProcess[str][source]¶

Validates a Kerberos keytab file using kinit.

Parameters:

keytab_filepath (str) – Path to the Kerberos keytab file.
principal_name (str) – Kerberos principal name associated with the keytab.

Returns:

Result of the kinit command execution containing:

returncode: 0 if successful, non-zero if failed
stdout: Standard output from the command
stderr: Standard error from the command

Return type:

subprocess.CompletedProcess[str]

Raises:

ValueError – If keytab file doesn’t exist or inputs are invalid
FileNotFoundError – If kinit command is not available
subprocess.SubprocessError – If command execution fails

Examples

>>> result = keytab_valid("/path/to/keytab", "user@REALM.COM")
>>> result.returncode == 0  # True if keytab is valid
True

data_manipulation.mysql_connector_python_ module¶

exception data_manipulation.mysql_connector_python_.DatabaseError[source]¶

Bases: Exception

Custom exception for database-related errors.

data_manipulation.mysql_connector_python_.create_connection_pool(host: str, dbname: str, user: str, password: str, port: int = 3306, pool_size: int = 5) → Any[source]¶

Creates a connection pool for MySQL database.

Parameters:

host (str) – Database host address.
dbname (str) – Database name to connect to.
user (str) – Username for authentication.
password (str) – Password for authentication.
port (int, optional) – Database port number. Defaults to 3306.
pool_size (int, optional) – Size of the connection pool. Defaults to 5.

Returns:

Database connection pool object.

Return type:

Any

Raises:

DatabaseError – If pool creation fails.

data_manipulation.mysql_connector_python_.execute_query(connection: Any, sql_query: str, data: dict | tuple | None = None, fetch: bool = False) → int | None | List[Dict[str, Any]][source]¶

Executes a MySQL query with parameters.

Parameters:

connection (ConnectionType) – Active MySQL connection.
sql_query (str) – SQL query to execute.
data (Optional[Union[dict, tuple]], optional) – Parameters for the SQL query.
fetch (bool, optional) – Whether to fetch results. Defaults to False.

Returns:

For INSERT: Last inserted row ID
For SELECT: List of dictionaries containing the results
None for other operations or on failure

Return type:

Union[Optional[int], List[Dict[str, Any]], None]

Raises:

DatabaseError – If query execution fails.

data_manipulation.mysql_connector_python_.get_connection(host: str, dbname: str, user: str, password: str, port: int = 3306, use_pool: bool = False, pool: Any | None = None) → Generator[Any, None, None][source]¶

Context manager for database connections.

Parameters:

host (str) – Database host address.
dbname (str) – Database name to connect to.
user (str) – Username for authentication.
password (str) – Password for authentication.
port (int, optional) – Database port number. Defaults to 3306.
use_pool (bool, optional) – Whether to use connection pooling. Defaults to False.
pool (Optional[Any], optional) – Existing connection pool.

Yields:

ConnectionType – Database connection object.

Raises:

DatabaseError – If connection fails.

data_manipulation.openldap_ module¶

data_manipulation.openldap_.ldapsearch(search_base: str, ldap_uri: str, bind_dn: str, password: str, search_filter: str) → CompletedProcess[source]¶

Executes an LDAP search query using the ldapsearch command.

Parameters:

search_base (str) – Base DN for the search operation.
ldap_uri (str) – LDAP server URI (e.g., “ldap://example.com:389”).
bind_dn (str) – Distinguished Name (DN) for binding to the LDAP server.
password (str) – Password for authentication.
search_filter (str) – LDAP search filter (e.g., “(objectClass=person)”).

Returns:

Result of the ldapsearch command execution.

Return type:

subprocess.CompletedProcess

Raises:

ValueError – If any input parameters are empty or invalid.
subprocess.SubprocessError – If the ldapsearch command fails to execute.
FileNotFoundError – If ldapsearch command is not available.

Examples

>>> result = ldapsearch(
...     "dc=example,dc=com",
...     "ldap://ldap.example.com",
...     "cn=admin,dc=example,dc=com",
...     "password",
...     "(objectClass=person)"
... )
>>> result.returncode == 0  # True if search was successful
True

data_manipulation.openldap_.validate_ldap_uri(uri: str) → bool[source]¶: Validate LDAP URI format.

data_manipulation.pandas_ module¶

data_manipulation.pandas_.add_type_columns(dataframe: DataFrame) → DataFrame[source]¶

Adds type information columns for each column in the dataframe.

Parameters:

dataframe (pd.DataFrame) – Input dataframe to analyze.

Returns:

DataFrame with additional type columns. For each original column ‘X’,: adds a column ‘X_type’ containing the Python type of each value.

Return type:

pd.DataFrame

Raises:

TypeError – If input is not a pandas DataFrame.

Examples

>>> df = pd.DataFrame({'a': [1, 2], 'b': ['x', 'y']})
>>> add_type_columns(df)
   a        a_type  b        b_type
0  1  <class 'int'>  x  <class 'str'>
1  2  <class 'int'>  y  <class 'str'>

data_manipulation.pandas_.aggregate_set_without_none(column: Series, nested_set: bool = False) → set[source]¶

Creates a set from a series, excluding None values.

Parameters:

column (pd.Series) – Series to aggregate into a set.
nested_set (bool, optional) – Whether to handle nested sets. Defaults to False.

Returns:

Set containing non-None values from the series.

Return type:

set

Examples

>>> s = pd.Series([1, None, 2, None, 3])
>>> aggregate_set_without_none(s)
{1.0, 2.0, 3.0, nan, nan}

data_manipulation.pandas_.chunking_dataframe(dataframe: DataFrame, chunk_size: int) → List[DataFrame][source]¶

Splits a dataframe into smaller chunks of specified size.

Parameters:

dataframe (pd.DataFrame) – DataFrame to split.
chunk_size (int) – Maximum number of rows in each chunk.

Returns:

List of DataFrames, each containing at most chunk_size rows.

Return type:

List[pd.DataFrame]

Raises:

TypeError – If dataframe is not a pandas DataFrame or chunk_size is not an integer.

Examples

>>> df = pd.DataFrame({'a': range(5)})
>>> chunks = chunking_dataframe(df, 2)
>>> [len(chunk) for chunk in chunks]
[2, 2, 1]

data_manipulation.pandas_.clean_none(dataframe: DataFrame, nan_to_none: bool = True, clean_variation: bool = True, none_variations: List[str] = []) → DataFrame[source]¶

Standardizes None values in a dataframe.

Parameters:

dataframe (pd.DataFrame) – Input DataFrame.
nan_to_none (bool, optional) – Convert NaN to None. Defaults to True.
clean_variation (bool, optional) – Clean common None variations. Defaults to True.
none_variations (List[str], optional) – Additional None variations to clean. Defaults to [].

Returns:

DataFrame with standardized None values.

Return type:

pd.DataFrame

Note

Deprecated as of pandas 1.3.0.

data_manipulation.pandas_.compare_all_list_items(list_: List[Any]) → DataFrame[source]¶

Creates a DataFrame comparing all possible pairs of items in a list.

Parameters:

list (List[Any]) – Input list containing items to compare

Returns:

DataFrame with columns:

item1: First item in comparison
item2: Second item in comparison
item1_eq_item2: Boolean indicating if items are equal

Return type:

pd.DataFrame

Raises:

TypeError – If input is not a list

Examples

>>> compare_all_list_items([1, 1, 2, 3])
  item1 item2  item1_eq_item2
   1     1            True
   1     2           False
   1     3           False
   1     2           False
   1     3           False
   2     3           False

data_manipulation.pandas_.compare_dataframes(dataframe1: DataFrame, dataframe2: DataFrame) → None[source]¶

Compares two dataframes and prints detailed comparison information.

Parameters:

dataframe1 (pd.DataFrame) – First DataFrame to compare.
dataframe2 (pd.DataFrame) – Second DataFrame to compare.

Raises:

TypeError – If either input is not a pandas DataFrame.

Note

Prints comparison results including: - Row counts - Column-wise comparison of non-null values - Detailed analysis of differences when counts don’t match

data_manipulation.pandas_.config_pandas_display() → None[source]¶

Configures pandas display settings for better output readability.

Sets the following pandas display options:

display.max_columns: 500
display.max_colwidth: 500
display.expand_frame_repr: True

Note

Modifies global pandas settings.

data_manipulation.pandas_.dtypes_dictionary(dataframe: DataFrame) → Dict[type, List[str]][source]¶

Creates a dictionary mapping Python types to column names in a DataFrame.

Parameters:

dataframe (pd.DataFrame) – Input DataFrame to analyze

Returns:

Dictionary where:

keys are Python types (e.g., str, int, float)
values are lists of column names with that type

Return type:

Dict[type, List[str]]

Raises:

TypeError – If input is not a DataFrame
ValueError – If DataFrame is empty

Examples

>>> df = pd.DataFrame({
...     'int_': [0],
...     'str_': ['0'],
...     'list_': [[]],
...     'dict_': [{}],
...     'none_': [None]
... })
>>> dtypes_dictionary(df)
{
    <class 'int'>: ['int_'],
    <class 'str'>: ['str_'],
    <class 'list'>: ['list_'],
    <class 'dict'>: ['dict_'],
    <class 'NoneType'>: ['none_']
}

data_manipulation.pandas_.head_tail(dataframe: DataFrame, n: int = 5) → DataFrame[source]¶

Returns first and last n rows of a dataframe.

Parameters:

dataframe (pd.DataFrame) – Input DataFrame.
n (int, optional) – Number of rows from top and bottom. Defaults to 5.

Returns:

Concatenated first and last n rows.

Return type:

pd.DataFrame

Examples

>>> df = pd.DataFrame({'a': range(10)})
>>> len(head_tail(df, 2))
4

data_manipulation.pandas_.index_marks(n_rows: int, chunk_size: int) → range[source]¶

Calculates indices for splitting data into chunks of specified size.

Parameters:

n_rows (int) – Total number of rows in the dataset
chunk_size (int) – Desired size of each chunk

Returns:

Range object containing indices where chunks should be split

Return type:

range

Raises:

TypeError – If either n_rows or chunk_size is not an integer

Examples

>>> index_marks(10, 3)
range(3, 12, 3)
>>> list(index_marks(10, 3))
[3, 6, 9]

Note

Used internally by chunking_dataframe() to determine split points for breaking large datasets into smaller chunks.

data_manipulation.pandas_.is_running_in_jupyter() → bool[source]¶

Checks if code is running in a Jupyter notebook environment.

Returns:: True if running in Jupyter notebook, False otherwise.
Return type:: bool

Examples

>>> is_running_in_jupyter()
True  # When running in Jupyter
False  # When running in regular Python

data_manipulation.pandas_.print_dataframe_overview(dataframe: DataFrame, stats: bool = False) → None[source]¶

Prints comprehensive overview of DataFrame contents and statistics.

Displays for each column: - Number of unique elements - Number of null elements - Value counts with percentages - Optional statistics (min/max) if stats=True

Parameters:

dataframe (pd.DataFrame) – DataFrame to analyze
stats (bool, optional) – Whether to include min/max statistics. Defaults to False.

Raises:

TypeError – If input is not a pandas DataFrame

Examples

>>> data = {"int_": [1, 1, 2], "str_": ["a", "b", "b"]}
>>> df = pd.DataFrame(data)
>>> print_dataframe_overview(df)
==============================
int_
==============================
Unique elements: 2
Null elements: 0
    count    percent
int_
1         2  66.666667
2         1  33.333333
==============================
str_
==============================
Unique elements: 2
Null elements: 0
    count    percent
str_
b         2  66.666667
a         1  33.333333

data_manipulation.pandas_.ps_aux_dataframe(ps_aux_commands: str) → DataFrame[source]¶

Converts Linux ps aux command output to a DataFrame.

Parameters:

ps_aux_commands (str) – ps aux command string to execute.

Returns:

DataFrame containing process information with columns matching: ps aux output format.

Return type:

pd.DataFrame

Examples

>>> df = ps_aux_dataframe("ps aux | head")
>>> 'PID' in df.columns
True

data_manipulation.pandas_.series_count(series: Series) → DataFrame[source]¶

Calculates value counts and percentages for a Series.

Enhanced version of series.value_counts() that includes percentage calculations.

Parameters:

series (pd.Series) – Input series to analyze

Returns:

DataFrame with columns:

count: Frequency of each unique value
percent: Percentage of total (0-100) for each value

Return type:

pd.DataFrame

Raises:

TypeError – If input is not a pandas Series

Examples

>>> values = [1, 1, 1, 2, 2, 3]
>>> s = pd.Series(values)
>>> series_count(s)
   count    percent
1      3  50.000000
2      2  33.333333
3      1  16.666667

data_manipulation.pandas_.series_to_columns(dataframe: DataFrame, column: str) → DataFrame[source]¶

Expands a column containing dictionaries into separate columns.

Parameters:

dataframe (pd.DataFrame) – Input dataframe.
column (str) – Name of column containing dictionaries to expand.

Returns:

DataFrame with dictionary column expanded into separate columns.

Return type:

pd.DataFrame

Examples

>>> values = [[1, {'a':1, 'b':2}]]
>>> cols = ["c1", "c2"]
>>> df = pd.DataFrame(values, columns=cols)
>>> df
c1                c2
0   1  {'a': 1, 'b': 2}
>>> series_to_columns(df, "c2")
c1  a  b
0   1  1  2

data_manipulation.pandas_.split_dataframe(dataframe: DataFrame, uuid: str, columns: List[str]) → Tuple[DataFrame, DataFrame][source]¶

Splits a dataframe into two based on specified columns while maintaining a linking ID.

Parameters:

dataframe (pd.DataFrame) – Input DataFrame to split.
uuid (str) – Name of the column containing unique identifiers.
columns (List[str]) – List of column names to split into second DataFrame.

Returns:

Tuple containing:

First DataFrame with specified columns removed
Second DataFrame with only specified columns and uuid

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Raises:

TypeError – If inputs are not of correct types.

Examples

>>> df = pd.DataFrame({'id': [1], 'a': [2], 'b': [3]})
>>> df1, df2 = split_dataframe(df, 'id', ['b'])
>>> df1
   id  a
0   1  2
>>> df2
   b  id
0  3   1

data_manipulation.pandas_.split_left_merged_dataframe(dataframe1: DataFrame, dataframe2: DataFrame, columns: List[str]) → Tuple[DataFrame, DataFrame][source]¶

Splits result of a left merge into matched and unmatched records.

Parameters:

dataframe1 (pd.DataFrame) – Left DataFrame for merge.
dataframe2 (pd.DataFrame) – Right DataFrame for merge.
columns (List[str]) – Columns to merge on.

Returns:

Tuple containing:

DataFrame with rows that matched in both inputs
DataFrame with rows only present in left input

Return type:

Tuple[pd.DataFrame, pd.DataFrame]

Examples

>>> df1 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo', "zen"], 'value': [1, 2, 3, 5, 9]})
>>> df2 = pd.DataFrame({'key': ['foo', 'bar', 'baz', 'foo'], 'value': [5, 6, 7, 8]})
>>> df_both, df_left = split_left_merged_dataframe(df1, df2, ["key"])
                count    percent
_merge
both            6  85.714286
left_only       1  14.285714
right_only      0   0.000000
>>> df_both
   key  value  value_y _merge
0  foo      1      5.0   both
1  foo      1      8.0   both
2  bar      2      6.0   both
3  baz      3      7.0   both
4  foo      5      5.0   both
5  foo      5      8.0   both
>>> df_left
   key  value  value_y     _merge
6  zen      9      NaN  left_only

data_manipulation.pandas_.to_excel_keep_url(filepath: str, dataframe: DataFrame) → None[source]¶

Exports DataFrame to Excel while preserving URLs as clickable links.

Handles Excel’s automatic URL conversion and limitations: - Preserves URLs without auto-conversion - Handles Excel’s limit of 65,530 URLs per worksheet - Manages URLs longer than 255 characters

Parameters:

filepath (str) – Output Excel file path (including .xlsx extension)
dataframe (pd.DataFrame) – DataFrame to export

Raises:

TypeError – If filepath is not a string or dataframe is not a DataFrame
IOError – If file cannot be written to specified location

Examples

>>> df = pd.DataFrame({
...     'url': ['https://www.example.com'],
...     'data': ['test']
... })
>>> to_excel_keep_url('output.xlsx', df)
Excel exported ...

Note

Uses xlsxwriter engine with specific options to prevent URL auto-conversion. Make sure the target directory exists before calling this function.

data_manipulation.pandas_.useless_columns(dataframe: DataFrame) → Tuple[List[str], List[str]][source]¶

Identifies empty and single-value columns in a DataFrame.

Parameters:

dataframe (pd.DataFrame) – DataFrame to analyze

Returns:

Two lists containing:

List of column names that are completely empty
List of column names that contain only a single unique value

Return type:

Tuple[List[str], List[str]]

Raises:

TypeError – If input is not a pandas DataFrame

Examples

>>> df = pd.DataFrame({
...     'empty': [None, None],
...     'single': ['A', 'A'],
...     'normal': ['A', 'B']
... })
>>> empty_cols, single_cols = useless_columns(df)
Empty columns: ['empty']
Single value columns: ['single']
>>> print(f"Empty columns: {empty_cols}")
Empty columns: ['empty']
>>> print(f"Single value columns: {single_cols}")
Single value columns: ['single']

Note

Empty columns are those where all values are None/NaN. Single-value columns may be candidates for removal or conversion to constants.

data_manipulation.prometheus_ module¶

data_manipulation.postgres_ module¶

data_manipulation.pyspark_ module¶

data_manipulation.pyspark_.add_dummy_columns(dataframe: DataFrame, columns: List[str], value: str) → DataFrame[source]¶

Adds new columns with default values to a Spark DataFrame.

Parameters:

dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame
columns (List[str]) – List of column names to add
value (str) – Default value for the new columns

Returns:

DataFrame with added columns

Return type:

pyspark.sql.DataFrame

Raises:

TypeError – If arguments are not of correct type - dataframe must be a Spark DataFrame - columns must be a list - value must be a string

Examples

>>> df = spark.createDataFrame([("Alice", 1)], ["name", "id"])
>>> new_df = add_dummy_columns(df, ["age", "city"], "unknown")
>>> new_df.show()
+-----+---+---+------+
| name| id|age|  city|
+-----+---+---+------+
|Alice|  1|unknown|unknown|
+-----+---+---+------+

data_manipulation.pyspark_.column_into_list(dataframe: DataFrame, column: str) → List[Any][source]¶

Extracts values from a DataFrame column into a Python list.

Parameters:

dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame
column (str) – Name of the column to extract

Returns:

List containing all values from the specified column,: including duplicates

Return type:

List[Any]

Raises:

TypeError – If dataframe is not a Spark DataFrame or column is not a string

Examples

>>> df = spark.createDataFrame([(1,), (2,), (2,)], ["value"])
>>> column_into_list(df, "value")
[1, 2, 2]

data_manipulation.pyspark_.column_into_set(dataframe: DataFrame, column: str) → Set[Any][source]¶

Extracts unique values from a DataFrame column into a Python set.

Parameters:

dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame
column (str) – Name of the column to extract

Returns:

Set containing unique values from the specified column

Return type:

Set[Any]

Raises:

TypeError – If dataframe is not a Spark DataFrame or column is not a string

Examples

>>> df = spark.createDataFrame([(1,), (2,), (2,)], ["value"])
>>> column_into_set(df, "value")
{1, 2}

data_manipulation.pyspark_.columns_prefix(dataframe: DataFrame, prefix: str) → DataFrame[source]¶

Adds a prefix to all column names in a DataFrame.

Parameters:

dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame
prefix (str) – Prefix to add to column names

Returns:

DataFrame with renamed columns

Return type:

pyspark.sql.DataFrame

Raises:

TypeError – If dataframe is not a Spark DataFrame or prefix is not a string

Examples

>>> df = spark.createDataFrame([("Alice", 1)], ["name", "id"])
>>> new_df = columns_prefix(df, "user_")
>>> new_df.show()
+---------+-------+
|user_name|user_id|
+---------+-------+
|    Alice|      1|
+---------+-------+

data_manipulation.pyspark_.columns_statistics(dataframe: DataFrame, n: int = 10) → Tuple[List[str], List[str]][source]¶

Analyzes column statistics and identifies empty and single-value columns.

Performs comprehensive analysis of each column including: - Value counts - Empty value detection - Single value detection - Basic statistics

Parameters:

dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame
n (int, optional) – Number of top values to display for each column. Defaults to 10.

Returns:

Two lists containing:

List of empty column names
List of single-value column names

Return type:

Tuple[List[str], List[str]]

Raises:

TypeError – If dataframe is not a Spark DataFrame

Examples

>>> df = spark.createDataFrame([
...     ("Alice", None),
...     ("Alice", None)
... ], ["name", "email"])
>>> empty_cols, single_cols = columns_statistics(df)
>>> print(f"Empty columns: {empty_cols}")
Empty columns: ['email']
>>> print(f"Single value columns: {single_cols}")
Single value columns: ['name']

data_manipulation.pyspark_.config_spark_local(autoset: bool = True) → None[source]¶

Configures Spark for local execution with optimized settings based on system resources.

Automatically calculates and sets optimal Spark configuration parameters based on: - Available CPU cores - System memory - Executor allocation - Memory distribution

Parameters:: autoset (bool, optional) – Whether to automatically apply the configuration. If False, only prints recommended settings. Defaults to True.

Examples

>>> config_spark_local()
Here is the current computer specs ...
executor_per_node: 1
spark_executor_instances: 1
total_executor_memory: 30
spark_executor_memory: 27
memory_overhead: 3
spark_default_parallelism: 10
spark.sql.execution.arrow.pyspark.enabled recommended by Koalas ...
spark auto-configured ...

Note

Configuration includes: - Executor cores and memory - Driver cores and memory - Memory overhead - Default parallelism - Shuffle partitions - Arrow optimization for PySpark

data_manipulation.pyspark_.describe(dataframe: DataFrame) → None[source]¶

Prints comprehensive information about a DataFrame.

Displays: - DataFrame type - Number of columns - Number of rows - Schema information

Parameters:: dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame
Raises:: TypeError – If dataframe is not a Spark DataFrame

Examples

>>> df = spark.createDataFrame([("Alice", 1)], ["name", "id"])
>>> describe(df)
The dataframe: <class 'pyspark.sql.dataframe.DataFrame'>
Number of columns: 2
Number of rows: 1
root
 |-- name: string (nullable = true)
 |-- id: long (nullable = true)

data_manipulation.pyspark_.group_count(dataframe: DataFrame, columns: str | List[str], n: int | float = 10) → DataFrame[source]¶

Performs group by operation and calculates count and percentage for each group.

Parameters:

dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame
columns (Union[str, List[str]]) – Column(s) to group by
n (Union[int, float], optional) – Number of top groups to return. Use float(‘inf’) for all groups. Defaults to 10.

Returns:

DataFrame with columns:

Group by column(s)
count: Count of records in each group
percent: Percentage of total records in each group

Return type:

pyspark.sql.DataFrame

Raises:

TypeError – If arguments are not of correct type

Examples

>>> df = spark.createDataFrame([
...     (1, 'A'), (1, 'B'), (2, 'A')
... ], ["id", "category"])
>>> group_count(df, ["id"]).show()
+---+-----+-------+
| id|count|percent|
+---+-----+-------+
|  1|    2|   66.7|
|  2|    1|   33.3|
+---+-----+-------+

data_manipulation.pyspark_.rename(dataframe: DataFrame, columns: Dict[str, str]) → DataFrame[source]¶

Renames multiple columns in a DataFrame using a mapping dictionary.

Parameters:

dataframe (pyspark.sql.DataFrame) – Input Spark DataFrame
columns (Dict[str, str]) – Dictionary mapping old column names to new names

Returns:

DataFrame with renamed columns

Return type:

pyspark.sql.DataFrame

Raises:

TypeError – If dataframe is not a Spark DataFrame or columns is not a dict

Examples

>>> df = spark.createDataFrame([("Alice", 1)], ["_1", "_2"])
>>> new_df = rename(df, {"_1": "name", "_2": "id"})
>>> new_df.show()
+-----+---+
| name| id|
+-----+---+
|Alice|  1|
+-----+---+

data_manipulation.smtplib_ module¶

data_manipulation.smtplib_.send_email(logname: str, message_subject: str, message_sender: str, message_receiver: str, html: str, smtp_address: str, smtp_port: int = 587, smtp_username: str | None = None, smtp_password: str | None = None) → bool[source]¶

Sends an HTML email using SMTP with TLS and logs the operation.

Parameters:

logname (str) – Path to the log file
message_subject (str) – Subject line of the email
message_sender (str) – Email address of the sender
message_receiver (str) – Email address of the recipient
html (str) – HTML content of the email body
smtp_address (str) – SMTP server address (e.g., ‘smtp.gmail.com’)
smtp_port (int, optional) – SMTP server port. Defaults to 587.
smtp_username (str, optional) – SMTP authentication username
smtp_password (str, optional) – SMTP authentication password

Returns:

True if email was sent successfully, False otherwise

Return type:

bool

Raises:

ValueError – If required parameters are empty or invalid
SMTPException – If email sending fails
IOError – If log file cannot be accessed

data_manipulation.sqlalchemy_ module¶

exception data_manipulation.sqlalchemy_.DatabaseConnectionError[source]¶

Bases: Exception

Custom exception for database connection errors.

data_manipulation.sqlalchemy_.create_sqlalchemy_engine(drivername: str, host: str, dbname: str, user: str, password: str, port: int = 3306, pool_size: int = 5, max_overflow: int = 10, pool_timeout: int = 30, connect_timeout: int = 10, ssl_ca: str | None = None) → Engine[source]¶

Creates and tests a SQLAlchemy engine for database operations.

Parameters:

drivername (str) – Database driver name. Supported options include: - ‘mysql+mysqlconnector’ - ‘mysql+pymysql’ - ‘postgresql+psycopg’ - ‘mssql+pyodbc’ - ‘oracle+cx_oracle’ - ‘sqlite3’
host (str) – Database server hostname or IP address
dbname (str) – Name of the target database
user (str) – Database username for authentication
password (str) – Database password for authentication
port (int, optional) – Database server port number. Defaults to 3306.
pool_size (int, optional) – The size of the connection pool. Defaults to 5.
max_overflow (int, optional) – Maximum number of connections above pool_size. Defaults to 10.
pool_timeout (int, optional) – Timeout for getting a connection from pool. Defaults to 30.
connect_timeout (int, optional) – Timeout for database connections. Defaults to 10.
ssl_ca (Optional[str], optional) – Path to SSL CA certificate. Defaults to None.

Returns:

Configured database engine object

Return type:

sqlalchemy.engine.base.Engine

Raises:

DatabaseConnectionError – If engine creation or connection test fails

Examples

>>> engine = create_sqlalchemy_engine(
...     drivername='postgresql+psycopg',
...     host='localhost',
...     dbname='mydb',
...     user='admin',
...     password='secret',
...     port=5432
... )
# Logs "create_sqlalchemy_engine: True" on success
# or "create_sqlalchemy_engine: False (error_message)" on failure

Note

The function automatically tests the connection upon creation and logs the result using loguru. A successful connection will be logged as info, while failures will be logged as errors with the specific exception message.

data_manipulation.sqlalchemy_.create_sqlalchemy_url(drivername: str, host: str, dbname: str, user: str, password: str, port: int = 3306, query: Dict[str, Any] | None = None) → URL[source]¶

Creates a SQLAlchemy URL object for database connection.

Parameters:

drivername (str) – Database driver name. Supported options include: - ‘mysql+mysqlconnector’ - ‘mysql+pymysql’ - ‘postgresql+psycopg’ - ‘mssql+pyodbc’ - ‘oracle+cx_oracle’ - ‘sqlite3’
host (str) – Database server hostname or IP address
dbname (str) – Name of the target database
user (str) – Database username for authentication
password (str) – Database password for authentication
port (int, optional) – Database server port number. Defaults to 3306.
query (Optional[Dict[str, Any]], optional) – Additional connection parameters. Useful for SSL configuration. Defaults to None.

Returns:

Configured URL object for database connection

Return type:

sqlalchemy.engine.url.URL

Examples

>>> url = create_sqlalchemy_url(
...     drivername='postgresql+psycopg',
...     host='localhost',
...     dbname='mydb',
...     user='admin',
...     password='secret',
...     port=5432
... )
>>> str(url)
'postgresql+psycopg://admin:secret@localhost:5432/mydb'

data_manipulation.sqlalchemy_.dispose_engine(engine: Engine) → None[source]¶

Safely dispose of the SQLAlchemy engine and its connection pool.

Parameters:: engine (Engine) – The SQLAlchemy engine to dispose

data_manipulation package¶

Submodules¶

data_manipulation.base module¶

data_manipulation.beautifulsoup_ module¶

data_manipulation.boto3_ module¶

data_manipulation.cryptography_ module¶

data_manipulation.django_ module¶

data_manipulation.flask_ module¶

data_manipulation.geopandas_ module¶

data_manipulation.kerberos_ module¶

data_manipulation.mysql_connector_python_ module¶

data_manipulation.openldap_ module¶

data_manipulation.pandas_ module¶

data_manipulation.prometheus_ module¶

data_manipulation.postgres_ module¶

data_manipulation.pyspark_ module¶

data_manipulation.smtplib_ module¶

data_manipulation.sqlalchemy_ module¶

Module contents¶

Table of Contents

Previous topic

This Page