{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "623a6fd2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'1.33.0'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import polars as pl\n",
    "pl.__version__  # The book is built with Polars version 1.20.0"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1aef94a4",
   "metadata": {},
   "source": [
    "# Polars Data Types and Missing Values\n",
    "\n",
    "This notebook covers the fundamental data types in Polars, including nested types like Arrays, Lists, and Structs. \n",
    "It also dives deep into handling missing data (`null`) and special floating point values (`NaN`), as well as data type conversions."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06781dc5-86cf-49d2-b00a-cec9acf86435",
   "metadata": {},
   "source": [
    "# Polars Series, Dataframe, Lazyframe"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "66da48de",
   "metadata": {},
   "source": [
    "Polars provides three main data structures:\n",
    "- **Series**: A one-dimensional homogeneous array with a name.\n",
    "- **DataFrame**: A two-dimensional table with named columns of potentially different types.\n",
    "- **LazyFrame**: A representation of a query plan that hasn't been executed yet. It allows Polars to optimize queries before running them."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "db705e6f",
   "metadata": {},
   "outputs": [],
   "source": [
    "sales_series = pl.Series([150.00,300.00,250.00])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "cab68e97",
   "metadata": {},
   "outputs": [],
   "source": [
    "sales_df = pl.DataFrame(\n",
    "    {\n",
    "        \"sales\":sales_series,\n",
    "        \"customer_id\":[24,25,26]\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "9b4884e5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (3, 2)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>sales</th><th>customer_id</th></tr><tr><td>f64</td><td>i64</td></tr></thead><tbody><tr><td>150.0</td><td>24</td></tr><tr><td>300.0</td><td>25</td></tr><tr><td>250.0</td><td>26</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (3, 2)\n",
       "┌───────┬─────────────┐\n",
       "│ sales ┆ customer_id │\n",
       "│ ---   ┆ ---         │\n",
       "│ f64   ┆ i64         │\n",
       "╞═══════╪═════════════╡\n",
       "│ 150.0 ┆ 24          │\n",
       "│ 300.0 ┆ 25          │\n",
       "│ 250.0 ┆ 26          │\n",
       "└───────┴─────────────┘"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sales_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "c090668a",
   "metadata": {},
   "outputs": [],
   "source": [
    "lazy_df = pl.scan_csv(\"data/fruit.csv\").with_columns(is_heavy = pl.col(\"weight\")>200)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "6f6e6d13",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/svg+xml": [
       "<svg xmlns=\"http://www.w3.org/2000/svg\" xmlns:xlink=\"http://www.w3.org/1999/xlink\" width=\"503pt\" height=\"121pt\" viewBox=\"0.00 0.00 503.00 121.00\">\n",
       "<g id=\"graph0\" class=\"graph\" transform=\"scale(1 1) rotate(0) translate(4 117)\">\n",
       "<title>polars_query</title>\n",
       "<polygon fill=\"white\" stroke=\"none\" points=\"-4,4 -4,-117 498.5,-117 498.5,4 -4,4\"/>\n",
       "<!-- p2 -->\n",
       "<g id=\"node1\" class=\"node\">\n",
       "<title>p2</title>\n",
       "<polygon fill=\"none\" stroke=\"black\" points=\"358.38,-41 136.12,-41 136.12,0 358.38,0 358.38,-41\"/>\n",
       "<text xml:space=\"preserve\" text-anchor=\"middle\" x=\"247.25\" y=\"-23.7\" font-family=\"Monospace\" font-size=\"14.00\">Csv SCAN [data/fruit.csv]</text>\n",
       "<text xml:space=\"preserve\" text-anchor=\"middle\" x=\"247.25\" y=\"-7.2\" font-family=\"Monospace\" font-size=\"14.00\">π */5;</text>\n",
       "</g>\n",
       "<!-- p1 -->\n",
       "<g id=\"node2\" class=\"node\">\n",
       "<title>p1</title>\n",
       "<polygon fill=\"none\" stroke=\"black\" points=\"494.5,-113 0,-113 0,-77 494.5,-77 494.5,-113\"/>\n",
       "<text xml:space=\"preserve\" text-anchor=\"middle\" x=\"247.25\" y=\"-89.95\" font-family=\"Monospace\" font-size=\"14.00\">WITH COLUMNS [[(col(\"weight\")) &gt; (200)].alias(\"is_heavy\")]</text>\n",
       "</g>\n",
       "<!-- p2&#45;&gt;p1 -->\n",
       "<g id=\"edge1\" class=\"edge\">\n",
       "<title>p2-&gt;p1</title>\n",
       "<path fill=\"none\" stroke=\"black\" d=\"M247.25,-41.31C247.25,-48.75 247.25,-57.36 247.25,-65.43\"/>\n",
       "<polygon fill=\"black\" stroke=\"black\" points=\"243.75,-65.17 247.25,-75.17 250.75,-65.17 243.75,-65.17\"/>\n",
       "</g>\n",
       "</g>\n",
       "</svg>"
      ],
      "text/plain": [
       "<IPython.core.display.SVG object>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "lazy_df.show_graph()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "13f71120",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (10, 6)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>name</th><th>weight</th><th>color</th><th>is_round</th><th>origin</th><th>is_heavy</th></tr><tr><td>str</td><td>i64</td><td>str</td><td>bool</td><td>str</td><td>bool</td></tr></thead><tbody><tr><td>&quot;Avocado&quot;</td><td>200</td><td>&quot;green&quot;</td><td>false</td><td>&quot;South America&quot;</td><td>false</td></tr><tr><td>&quot;Banana&quot;</td><td>120</td><td>&quot;yellow&quot;</td><td>false</td><td>&quot;Asia&quot;</td><td>false</td></tr><tr><td>&quot;Blueberry&quot;</td><td>1</td><td>&quot;blue&quot;</td><td>false</td><td>&quot;North America&quot;</td><td>false</td></tr><tr><td>&quot;Cantaloupe&quot;</td><td>2500</td><td>&quot;orange&quot;</td><td>true</td><td>&quot;Africa&quot;</td><td>true</td></tr><tr><td>&quot;Cranberry&quot;</td><td>2</td><td>&quot;red&quot;</td><td>false</td><td>&quot;North America&quot;</td><td>false</td></tr><tr><td>&quot;Elderberry&quot;</td><td>1</td><td>&quot;black&quot;</td><td>false</td><td>&quot;Europe&quot;</td><td>false</td></tr><tr><td>&quot;Orange&quot;</td><td>130</td><td>&quot;orange&quot;</td><td>true</td><td>&quot;Asia&quot;</td><td>false</td></tr><tr><td>&quot;Papaya&quot;</td><td>1000</td><td>&quot;orange&quot;</td><td>false</td><td>&quot;South America&quot;</td><td>true</td></tr><tr><td>&quot;Peach&quot;</td><td>150</td><td>&quot;orange&quot;</td><td>true</td><td>&quot;Asia&quot;</td><td>false</td></tr><tr><td>&quot;Watermelon&quot;</td><td>5000</td><td>&quot;green&quot;</td><td>true</td><td>&quot;Africa&quot;</td><td>true</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (10, 6)\n",
       "┌────────────┬────────┬────────┬──────────┬───────────────┬──────────┐\n",
       "│ name       ┆ weight ┆ color  ┆ is_round ┆ origin        ┆ is_heavy │\n",
       "│ ---        ┆ ---    ┆ ---    ┆ ---      ┆ ---           ┆ ---      │\n",
       "│ str        ┆ i64    ┆ str    ┆ bool     ┆ str           ┆ bool     │\n",
       "╞════════════╪════════╪════════╪══════════╪═══════════════╪══════════╡\n",
       "│ Avocado    ┆ 200    ┆ green  ┆ false    ┆ South America ┆ false    │\n",
       "│ Banana     ┆ 120    ┆ yellow ┆ false    ┆ Asia          ┆ false    │\n",
       "│ Blueberry  ┆ 1      ┆ blue   ┆ false    ┆ North America ┆ false    │\n",
       "│ Cantaloupe ┆ 2500   ┆ orange ┆ true     ┆ Africa        ┆ true     │\n",
       "│ Cranberry  ┆ 2      ┆ red    ┆ false    ┆ North America ┆ false    │\n",
       "│ Elderberry ┆ 1      ┆ black  ┆ false    ┆ Europe        ┆ false    │\n",
       "│ Orange     ┆ 130    ┆ orange ┆ true     ┆ Asia          ┆ false    │\n",
       "│ Papaya     ┆ 1000   ┆ orange ┆ false    ┆ South America ┆ true     │\n",
       "│ Peach      ┆ 150    ┆ orange ┆ true     ┆ Asia          ┆ false    │\n",
       "│ Watermelon ┆ 5000   ┆ green  ┆ true     ┆ Africa        ┆ true     │\n",
       "└────────────┴────────┴────────┴──────────┴───────────────┴──────────┘"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "lazy_df.collect()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e195a598-2a18-4c01-934e-9cb4c1d5dafe",
   "metadata": {},
   "source": [
    "## Polars Array"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a805c83a",
   "metadata": {},
   "source": [
    "The `Array` data type in Polars represents **fixed-size** lists. \n",
    "Unlike the `List` type, `Array` enforces that every element has the same number of items. \n",
    "This allows for more memory-efficient storage and execution."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "a229fc21",
   "metadata": {},
   "outputs": [],
   "source": [
    "coordinates = pl.DataFrame(\n",
    "    [\n",
    "        pl.Series('point2d',[[1,3],[2,3]]),\n",
    "        pl.Series('point3d',[[1,3,4],[4,5,6]]),\n",
    "    ],\n",
    "    schema={\n",
    "        'point2d':pl.Array(shape=2, inner=pl.Int64),\n",
    "        'point3d':pl.Array(shape=3, inner=pl.Int64),\n",
    "    }\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "b9709965",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (2, 2)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>point2d</th><th>point3d</th></tr><tr><td>array[i64, 2]</td><td>array[i64, 3]</td></tr></thead><tbody><tr><td>[1, 3]</td><td>[1, 3, 4]</td></tr><tr><td>[2, 3]</td><td>[4, 5, 6]</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (2, 2)\n",
       "┌───────────────┬───────────────┐\n",
       "│ point2d       ┆ point3d       │\n",
       "│ ---           ┆ ---           │\n",
       "│ array[i64, 2] ┆ array[i64, 3] │\n",
       "╞═══════════════╪═══════════════╡\n",
       "│ [1, 3]        ┆ [1, 3, 4]     │\n",
       "│ [2, 3]        ┆ [4, 5, 6]     │\n",
       "└───────────────┴───────────────┘"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "coordinates"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "778c496e-205b-414d-b081-369b5c929658",
   "metadata": {},
   "source": [
    "## Polars List"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "27457f96",
   "metadata": {},
   "source": [
    "The `List` data type allows for **variable-length** arrays within a column. \n",
    "This is useful for storing sequences of data, such as daily temperature readings or a list of tags."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "1edf8dea",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (2, 2)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>temperature</th><th>wind_speed</th></tr><tr><td>list[f64]</td><td>list[i64]</td></tr></thead><tbody><tr><td>[72.5, 75.0, 77.3]</td><td>[15, 20]</td></tr><tr><td>[68.0, 70.2]</td><td>[10, 12, … 16]</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (2, 2)\n",
       "┌────────────────────┬────────────────┐\n",
       "│ temperature        ┆ wind_speed     │\n",
       "│ ---                ┆ ---            │\n",
       "│ list[f64]          ┆ list[i64]      │\n",
       "╞════════════════════╪════════════════╡\n",
       "│ [72.5, 75.0, 77.3] ┆ [15, 20]       │\n",
       "│ [68.0, 70.2]       ┆ [10, 12, … 16] │\n",
       "└────────────────────┴────────────────┘"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "weather_readings = pl.DataFrame(\n",
    "{\n",
    "\"temperature\": [[72.5, 75.0, 77.3], [68.0, 70.2]],\n",
    "\"wind_speed\": [[15, 20], [10, 12, 14, 16]],\n",
    "}\n",
    ")\n",
    "weather_readings"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25dbd815-b8bd-402f-bba7-78c35224c395",
   "metadata": {},
   "source": [
    "## Polars Struct"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "724bca57",
   "metadata": {},
   "source": [
    "The `Struct` data type is similar to a dictionary or a row nested within a cell. \n",
    "It contains named fields, each with its own data type. Structs are useful for grouping related data together."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "83a9c5c0",
   "metadata": {},
   "outputs": [],
   "source": [
    "rating_series = pl.Series(\n",
    "    \"rating\",[\n",
    "        { \"Movies\":\"Cars\",\"Theater\":\"NE\",\"Avg_rating\":4.5},\n",
    "        {\"Movies\":\"Toy Story\",\"Theater\":\"ME\",\"Avg_rating\":4.9},\n",
    "    ],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "f89a8e1b",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (2,)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>rating</th></tr><tr><td>struct[3]</td></tr></thead><tbody><tr><td>{&quot;Cars&quot;,&quot;NE&quot;,4.5}</td></tr><tr><td>{&quot;Toy Story&quot;,&quot;ME&quot;,4.9}</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (2,)\n",
       "Series: 'rating' [struct[3]]\n",
       "[\n",
       "\t{\"Cars\",\"NE\",4.5}\n",
       "\t{\"Toy Story\",\"ME\",4.9}\n",
       "]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rating_series"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "478a20f4-cf2b-4958-b3c8-52ff605f1d0b",
   "metadata": {},
   "source": [
    "## Missing Data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78953e43",
   "metadata": {},
   "source": [
    "Missing data in Polars is represented by `null`. This is distinct from `NaN` (Not a Number). \n",
    "Polars provides extensive functionality to handle nulls, including filling them with specific values, strategies, or expressions."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40e4b2f9-82d9-40b7-a1de-1b078b68b3ff",
   "metadata": {},
   "source": [
    "### Missing single value, strategy and expresession"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "eb5e4a39",
   "metadata": {},
   "outputs": [],
   "source": [
    "missing_df = pl.DataFrame(\n",
    "{\n",
    "\"value\": [None, 2, 3, 4, None, None, 7, 8, 9, None],\n",
    "},\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "949b2d0d-e19f-49fe-8926-07ce78c34a2a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (10, 1)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>value</th></tr><tr><td>i64</td></tr></thead><tbody><tr><td>null</td></tr><tr><td>2</td></tr><tr><td>3</td></tr><tr><td>4</td></tr><tr><td>null</td></tr><tr><td>null</td></tr><tr><td>7</td></tr><tr><td>8</td></tr><tr><td>9</td></tr><tr><td>null</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (10, 1)\n",
       "┌───────┐\n",
       "│ value │\n",
       "│ ---   │\n",
       "│ i64   │\n",
       "╞═══════╡\n",
       "│ null  │\n",
       "│ 2     │\n",
       "│ 3     │\n",
       "│ 4     │\n",
       "│ null  │\n",
       "│ null  │\n",
       "│ 7     │\n",
       "│ 8     │\n",
       "│ 9     │\n",
       "│ null  │\n",
       "└───────┘"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "missing_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "18b9b6bf",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (1, 1)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>value</th></tr><tr><td>u32</td></tr></thead><tbody><tr><td>4</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (1, 1)\n",
       "┌───────┐\n",
       "│ value │\n",
       "│ ---   │\n",
       "│ u32   │\n",
       "╞═══════╡\n",
       "│ 4     │\n",
       "└───────┘"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "missing_df.null_count()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "13bab1fb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (10, 2)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>value</th><th>single_value_fill</th></tr><tr><td>i64</td><td>i64</td></tr></thead><tbody><tr><td>null</td><td>-1</td></tr><tr><td>2</td><td>2</td></tr><tr><td>3</td><td>3</td></tr><tr><td>4</td><td>4</td></tr><tr><td>null</td><td>-1</td></tr><tr><td>null</td><td>-1</td></tr><tr><td>7</td><td>7</td></tr><tr><td>8</td><td>8</td></tr><tr><td>9</td><td>9</td></tr><tr><td>null</td><td>-1</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (10, 2)\n",
       "┌───────┬───────────────────┐\n",
       "│ value ┆ single_value_fill │\n",
       "│ ---   ┆ ---               │\n",
       "│ i64   ┆ i64               │\n",
       "╞═══════╪═══════════════════╡\n",
       "│ null  ┆ -1                │\n",
       "│ 2     ┆ 2                 │\n",
       "│ 3     ┆ 3                 │\n",
       "│ 4     ┆ 4                 │\n",
       "│ null  ┆ -1                │\n",
       "│ null  ┆ -1                │\n",
       "│ 7     ┆ 7                 │\n",
       "│ 8     ┆ 8                 │\n",
       "│ 9     ┆ 9                 │\n",
       "│ null  ┆ -1                │\n",
       "└───────┴───────────────────┘"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "missing_df.with_columns(single_value_fill = pl.col('value').fill_null(-1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "c71e373f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (10, 8)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>value</th><th>forward</th><th>backward</th><th>min</th><th>max</th><th>mean</th><th>zero</th><th>one</th></tr><tr><td>i64</td><td>i64</td><td>i64</td><td>i64</td><td>i64</td><td>i64</td><td>i64</td><td>i64</td></tr></thead><tbody><tr><td>null</td><td>null</td><td>2</td><td>2</td><td>9</td><td>5</td><td>0</td><td>1</td></tr><tr><td>2</td><td>2</td><td>2</td><td>2</td><td>2</td><td>2</td><td>2</td><td>2</td></tr><tr><td>3</td><td>3</td><td>3</td><td>3</td><td>3</td><td>3</td><td>3</td><td>3</td></tr><tr><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td></tr><tr><td>null</td><td>4</td><td>7</td><td>2</td><td>9</td><td>5</td><td>0</td><td>1</td></tr><tr><td>null</td><td>4</td><td>7</td><td>2</td><td>9</td><td>5</td><td>0</td><td>1</td></tr><tr><td>7</td><td>7</td><td>7</td><td>7</td><td>7</td><td>7</td><td>7</td><td>7</td></tr><tr><td>8</td><td>8</td><td>8</td><td>8</td><td>8</td><td>8</td><td>8</td><td>8</td></tr><tr><td>9</td><td>9</td><td>9</td><td>9</td><td>9</td><td>9</td><td>9</td><td>9</td></tr><tr><td>null</td><td>9</td><td>null</td><td>2</td><td>9</td><td>5</td><td>0</td><td>1</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (10, 8)\n",
       "┌───────┬─────────┬──────────┬─────┬─────┬──────┬──────┬─────┐\n",
       "│ value ┆ forward ┆ backward ┆ min ┆ max ┆ mean ┆ zero ┆ one │\n",
       "│ ---   ┆ ---     ┆ ---      ┆ --- ┆ --- ┆ ---  ┆ ---  ┆ --- │\n",
       "│ i64   ┆ i64     ┆ i64      ┆ i64 ┆ i64 ┆ i64  ┆ i64  ┆ i64 │\n",
       "╞═══════╪═════════╪══════════╪═════╪═════╪══════╪══════╪═════╡\n",
       "│ null  ┆ null    ┆ 2        ┆ 2   ┆ 9   ┆ 5    ┆ 0    ┆ 1   │\n",
       "│ 2     ┆ 2       ┆ 2        ┆ 2   ┆ 2   ┆ 2    ┆ 2    ┆ 2   │\n",
       "│ 3     ┆ 3       ┆ 3        ┆ 3   ┆ 3   ┆ 3    ┆ 3    ┆ 3   │\n",
       "│ 4     ┆ 4       ┆ 4        ┆ 4   ┆ 4   ┆ 4    ┆ 4    ┆ 4   │\n",
       "│ null  ┆ 4       ┆ 7        ┆ 2   ┆ 9   ┆ 5    ┆ 0    ┆ 1   │\n",
       "│ null  ┆ 4       ┆ 7        ┆ 2   ┆ 9   ┆ 5    ┆ 0    ┆ 1   │\n",
       "│ 7     ┆ 7       ┆ 7        ┆ 7   ┆ 7   ┆ 7    ┆ 7    ┆ 7   │\n",
       "│ 8     ┆ 8       ┆ 8        ┆ 8   ┆ 8   ┆ 8    ┆ 8    ┆ 8   │\n",
       "│ 9     ┆ 9       ┆ 9        ┆ 9   ┆ 9   ┆ 9    ┆ 9    ┆ 9   │\n",
       "│ null  ┆ 9       ┆ null     ┆ 2   ┆ 9   ┆ 5    ┆ 0    ┆ 1   │\n",
       "└───────┴─────────┴──────────┴─────┴─────┴──────┴──────┴─────┘"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "missing_df.with_columns(\n",
    "    forward=pl.col(\"value\").fill_null(strategy=\"forward\"),\n",
    "    backward=pl.col(\"value\").fill_null(strategy=\"backward\"),\n",
    "    min=pl.col(\"value\").fill_null(strategy=\"min\"),\n",
    "    max=pl.col(\"value\").fill_null(strategy=\"max\"),\n",
    "    mean=pl.col(\"value\").fill_null(strategy=\"mean\"),\n",
    "    zero=pl.col(\"value\").fill_null(strategy=\"zero\"),\n",
    "    one=pl.col(\"value\").fill_null(strategy=\"one\"),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "4740ddf0-66b9-4d72-ac5f-2ed2145a525f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (10, 2)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>value</th><th>expression_mean</th></tr><tr><td>i64</td><td>f64</td></tr></thead><tbody><tr><td>null</td><td>5.5</td></tr><tr><td>2</td><td>2.0</td></tr><tr><td>3</td><td>3.0</td></tr><tr><td>4</td><td>4.0</td></tr><tr><td>null</td><td>5.5</td></tr><tr><td>null</td><td>5.5</td></tr><tr><td>7</td><td>7.0</td></tr><tr><td>8</td><td>8.0</td></tr><tr><td>9</td><td>9.0</td></tr><tr><td>null</td><td>5.5</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (10, 2)\n",
       "┌───────┬─────────────────┐\n",
       "│ value ┆ expression_mean │\n",
       "│ ---   ┆ ---             │\n",
       "│ i64   ┆ f64             │\n",
       "╞═══════╪═════════════════╡\n",
       "│ null  ┆ 5.5             │\n",
       "│ 2     ┆ 2.0             │\n",
       "│ 3     ┆ 3.0             │\n",
       "│ 4     ┆ 4.0             │\n",
       "│ null  ┆ 5.5             │\n",
       "│ null  ┆ 5.5             │\n",
       "│ 7     ┆ 7.0             │\n",
       "│ 8     ┆ 8.0             │\n",
       "│ 9     ┆ 9.0             │\n",
       "│ null  ┆ 5.5             │\n",
       "└───────┴─────────────────┘"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "missing_df.with_columns(\n",
    "expression_mean=pl.col(\"value\").fill_null(pl.col(\"value\").mean())\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "49fcbf6a-ce14-4e7e-a3de-2e897da3ab5a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (10, 1)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>value</th></tr><tr><td>f64</td></tr></thead><tbody><tr><td>null</td></tr><tr><td>2.0</td></tr><tr><td>3.0</td></tr><tr><td>4.0</td></tr><tr><td>5.0</td></tr><tr><td>6.0</td></tr><tr><td>7.0</td></tr><tr><td>8.0</td></tr><tr><td>9.0</td></tr><tr><td>null</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (10, 1)\n",
       "┌───────┐\n",
       "│ value │\n",
       "│ ---   │\n",
       "│ f64   │\n",
       "╞═══════╡\n",
       "│ null  │\n",
       "│ 2.0   │\n",
       "│ 3.0   │\n",
       "│ 4.0   │\n",
       "│ 5.0   │\n",
       "│ 6.0   │\n",
       "│ 7.0   │\n",
       "│ 8.0   │\n",
       "│ 9.0   │\n",
       "│ null  │\n",
       "└───────┘"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "missing_df.interpolate()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c9aa5d2b-3a8e-4192-b7e9-50512d95d429",
   "metadata": {},
   "source": [
    "# NULL vs Not a Number (NaN)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7414b43",
   "metadata": {},
   "source": [
    "It is crucial to distinguish between `null` and `NaN`:\n",
    "- **`null`**: Represents missing data. It applies to all data types.\n",
    "- **`NaN` (Not a Number)**: A special floating-point value representing undefined results (e.g., 0/0). It only applies to floating-point columns.\n",
    "\n",
    "Polars handles them differently. `null` is ignored in aggregations (like mean), while `NaN` propagates (NaN + 1 = NaN)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1811f375",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "nan_df = pl.DataFrame({\n",
    "    \"value\": [1.0, np.nan, None, 4.0]\n",
    "})\n",
    "print(nan_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c1f870aa",
   "metadata": {},
   "source": [
    "You can check for these values using `is_nan()` and `is_null()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "891e6717",
   "metadata": {},
   "outputs": [],
   "source": [
    "nan_df.with_columns(\n",
    "    is_nan = pl.col(\"value\").is_nan(),\n",
    "    is_null = pl.col(\"value\").is_null()\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af7eba83",
   "metadata": {},
   "source": [
    "To handle `NaN` values, you can use `fill_nan()`. Note that `fill_null()` does NOT affect `NaN` values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "df0a297f",
   "metadata": {},
   "outputs": [],
   "source": [
    "nan_df.with_columns(\n",
    "    filled_nan = pl.col(\"value\").fill_nan(0.0),\n",
    "    filled_null = pl.col(\"value\").fill_null(0.0)\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5077f6a4-6944-48a6-93b7-3fc82bcd322b",
   "metadata": {},
   "source": [
    "# Data Type Conversion"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ffbe84b",
   "metadata": {},
   "source": [
    "Changing data types (casting) is a common operation. Polars uses the `.cast()` method.\n",
    "By default, casting is **strict**. If a value cannot be converted (e.g., casting \"abc\" to Integer), Polars will raise an error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "5e7f42b9-d4ba-44a8-9099-f6134e1d9d69",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "shape: (3, 1)\n",
      "┌───────┐\n",
      "│ id    │\n",
      "│ ---   │\n",
      "│ str   │\n",
      "╞═══════╡\n",
      "│ 10000 │\n",
      "│ 20000 │\n",
      "│ 30000 │\n",
      "└───────┘\n",
      "Estimated size: 15 bytes\n"
     ]
    }
   ],
   "source": [
    "string_df = pl.DataFrame({\"id\": [\"10000\", \"20000\", \"30000\"]})\n",
    "print(string_df)\n",
    "print(f\"Estimated size: {string_df.estimated_size('b')} bytes\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "id": "ac16083f-9af7-4350-b681-5f22bd3f2759",
   "metadata": {},
   "outputs": [],
   "source": [
    "string_df_int = string_df.select(pl.col(\"id\").cast(pl.UInt16))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "id": "da0a1402-26b3-41dd-b3f8-e9fd6cf800e0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Estimated size: 6 bytes\n"
     ]
    }
   ],
   "source": [
    "print(f\"Estimated size: {string_df_int.estimated_size('b')} bytes\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "id": "4dcfab05-85e1-446e-a173-6b6c74b15c61",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_types_df = pl.DataFrame(\n",
    "{\n",
    "\"id\": [10000, 20000, 30000],\n",
    "\"value\": [1.0, 2.0, 3.0],\n",
    "\"value2\": [\"1\", \"2\", \"3\"],\n",
    "}\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "2aec813a-31a9-4072-be5c-9ce9c6129200",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (3, 3)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>id</th><th>value</th><th>value2</th></tr><tr><td>i64</td><td>f64</td><td>str</td></tr></thead><tbody><tr><td>10000</td><td>1.0</td><td>&quot;1&quot;</td></tr><tr><td>20000</td><td>2.0</td><td>&quot;2&quot;</td></tr><tr><td>30000</td><td>3.0</td><td>&quot;3&quot;</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (3, 3)\n",
       "┌───────┬───────┬────────┐\n",
       "│ id    ┆ value ┆ value2 │\n",
       "│ ---   ┆ ---   ┆ ---    │\n",
       "│ i64   ┆ f64   ┆ str    │\n",
       "╞═══════╪═══════╪════════╡\n",
       "│ 10000 ┆ 1.0   ┆ 1      │\n",
       "│ 20000 ┆ 2.0   ┆ 2      │\n",
       "│ 30000 ┆ 3.0   ┆ 3      │\n",
       "└───────┴───────┴────────┘"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_types_df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "id": "2028233a-9294-47bb-b712-29f44fc93765",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (3, 3)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>id</th><th>value</th><th>value2</th></tr><tr><td>u16</td><td>u16</td><td>u16</td></tr></thead><tbody><tr><td>10000</td><td>1</td><td>1</td></tr><tr><td>20000</td><td>2</td><td>2</td></tr><tr><td>30000</td><td>3</td><td>3</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (3, 3)\n",
       "┌───────┬───────┬────────┐\n",
       "│ id    ┆ value ┆ value2 │\n",
       "│ ---   ┆ ---   ┆ ---    │\n",
       "│ u16   ┆ u16   ┆ u16    │\n",
       "╞═══════╪═══════╪════════╡\n",
       "│ 10000 ┆ 1     ┆ 1      │\n",
       "│ 20000 ┆ 2     ┆ 2      │\n",
       "│ 30000 ┆ 3     ┆ 3      │\n",
       "└───────┴───────┴────────┘"
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_types_df.cast(pl.UInt16)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "id": "2987c280",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (3, 3)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>id</th><th>value</th><th>value2</th></tr><tr><td>u16</td><td>f32</td><td>u8</td></tr></thead><tbody><tr><td>10000</td><td>1.0</td><td>1</td></tr><tr><td>20000</td><td>2.0</td><td>2</td></tr><tr><td>30000</td><td>3.0</td><td>3</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (3, 3)\n",
       "┌───────┬───────┬────────┐\n",
       "│ id    ┆ value ┆ value2 │\n",
       "│ ---   ┆ ---   ┆ ---    │\n",
       "│ u16   ┆ f32   ┆ u8     │\n",
       "╞═══════╪═══════╪════════╡\n",
       "│ 10000 ┆ 1.0   ┆ 1      │\n",
       "│ 20000 ┆ 2.0   ┆ 2      │\n",
       "│ 30000 ┆ 3.0   ┆ 3      │\n",
       "└───────┴───────┴────────┘"
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_types_df.cast({\"id\": pl.UInt16, \"value\": pl.Float32, \"value2\": pl.UInt8})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "id": "cc26f3be",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (3, 3)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>id</th><th>value</th><th>value2</th></tr><tr><td>i64</td><td>f32</td><td>u8</td></tr></thead><tbody><tr><td>10000</td><td>1.0</td><td>1</td></tr><tr><td>20000</td><td>2.0</td><td>2</td></tr><tr><td>30000</td><td>3.0</td><td>3</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (3, 3)\n",
       "┌───────┬───────┬────────┐\n",
       "│ id    ┆ value ┆ value2 │\n",
       "│ ---   ┆ ---   ┆ ---    │\n",
       "│ i64   ┆ f32   ┆ u8     │\n",
       "╞═══════╪═══════╪════════╡\n",
       "│ 10000 ┆ 1.0   ┆ 1      │\n",
       "│ 20000 ┆ 2.0   ┆ 2      │\n",
       "│ 30000 ┆ 3.0   ┆ 3      │\n",
       "└───────┴───────┴────────┘"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_types_df.cast({pl.Float64: pl.Float32, pl.String: pl.UInt8})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "id": "3093d75b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import polars.selectors as cs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "3b764020",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div><style>\n",
       ".dataframe > thead > tr,\n",
       ".dataframe > tbody > tr {\n",
       "  text-align: right;\n",
       "  white-space: pre-wrap;\n",
       "}\n",
       "</style>\n",
       "<small>shape: (3, 3)</small><table border=\"1\" class=\"dataframe\"><thead><tr><th>id</th><th>value</th><th>value2</th></tr><tr><td>u16</td><td>u16</td><td>str</td></tr></thead><tbody><tr><td>10000</td><td>1</td><td>&quot;1&quot;</td></tr><tr><td>20000</td><td>2</td><td>&quot;2&quot;</td></tr><tr><td>30000</td><td>3</td><td>&quot;3&quot;</td></tr></tbody></table></div>"
      ],
      "text/plain": [
       "shape: (3, 3)\n",
       "┌───────┬───────┬────────┐\n",
       "│ id    ┆ value ┆ value2 │\n",
       "│ ---   ┆ ---   ┆ ---    │\n",
       "│ u16   ┆ u16   ┆ str    │\n",
       "╞═══════╪═══════╪════════╡\n",
       "│ 10000 ┆ 1     ┆ 1      │\n",
       "│ 20000 ┆ 2     ┆ 2      │\n",
       "│ 30000 ┆ 3     ┆ 3      │\n",
       "└───────┴───────┴────────┘"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data_types_df.cast({cs.numeric():pl.UInt16})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6a1b5f90",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3be7332c",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "86d925a8",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "34f0e09a",
   "metadata": {},
   "source": [
    "### Strict vs Non-Strict Casting\n",
    "You can disable strict mode to convert failing casts into `null` values instead of raising an error."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "473c888a",
   "metadata": {},
   "outputs": [],
   "source": [
    "strict_df = pl.DataFrame({\"val\": [\"1\", \"2\", \"a\"]})\n",
    "\n",
    "# This would raise an error:\n",
    "# strict_df.select(pl.col(\"val\").cast(pl.Int64))\n",
    "\n",
    "# Non-strict casting replaces errors with null:\n",
    "strict_df.select(pl.col(\"val\").cast(pl.Int64, strict=False))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
