从多层目录的Python文件中导入字典并构建Pandas DataFrame

2025-10-22 22:32

|

8

|

后端开发

1837 字

|

8 分钟

从多层目录的Python文件中导入字典并构建Pandas DataFrame

本教程详细介绍了如何从嵌套目录结构中的多个python文件里提取字典数据，并将其整合到一个pandas dataframe中。文章将指导读者使用`os.walk`遍历文件系统，通过文本处理和`ast.literal_eval`安全地解析字典字符串，最终利用pandas库高效地构建和合并数据帧，为处理分散的配置或数据文件提供实用的解决方案。

在许多项目中，我们可能会遇到这样的场景：配置信息、元数据或特定数据片段以python字典的形式分散存储在多个.py文件中，这些文件可能位于复杂的目录结构中。当需要将这些分散的字典数据统一收集并进行分析时，Pandas DataFrame是理想的数据结构。本文将提供一个专业的教程，指导您如何实现这一目标。

1. 理解问题与目标

假设您的项目结构如下：

base_directory/ ├── module_a/ │   └── sub_module_x/ │       └── form.py  # 包含一个字典 └── module_b/     ├── sub_module_y/     │   └── form.py  # 包含一个字典     └── sub_module_z/         └── form.py  # 包含一个字典

每个 form.py 文件中都包含一个字典，例如：

# form.py def_options = {"name": "Alice", "age": 30, "city": "New York"}

我们的目标是遍历所有这些 form.py 文件，提取其中的 def_options 字典，并将所有这些字典合并成一个统一的Pandas DataFrame。

立即学习“Python免费学习笔记（深入）”；

2. 文件系统遍历与定位

首先，我们需要一种机制来遍历指定根目录下的所有子目录和文件，以找到我们感兴趣的 form.py 文件。Python的 os 模块提供了 os.walk() 函数，它能够递归地遍历目录树。

import os import pandas as pd import ast # 用于安全地评估字符串为Python对象  # 定义您要搜索的根目录 # 替换为您的实际路径，例如：os.environ["JUPYTER_ROOT"] + "/charts/" base_directory = "/path/to/your/base_directory"  # 初始化一个空的DataFrame来存储所有字典数据 all_data_df = pd.DataFrame()  # 遍历目录 for root, dirs, files in os.walk(base_directory):     for file in files:         if file.endswith("form.py"):             file_path = os.path.join(root, file)             print(f"找到文件: {file_path}")             # 接下来的步骤将在此处处理文件内容

在上述代码中：

笔目鱼英文论文写作器

写高质量英文论文，就用笔目鱼

49

查看详情

base_directory 应替换为您的实际项目根目录。
os.walk(base_directory) 会生成三元组 (root, dirs, files)，其中 root 是当前正在遍历的目录路径，dirs 是当前目录下的子目录列表，files 是当前目录下的文件列表。
我们通过 file.endswith(“form.py”) 筛选出目标文件。
os.path.join(root, file) 用于构建文件的完整路径。

3. 从文件中安全提取字典

找到 form.py 文件后，下一步是从文件中读取其内容并提取字典。由于这些文件是python脚本，我们不能直接 import 它们（除非它们被设计成可导入的模块，且没有副作用），而是需要将它们作为文本文件来处理。

# ... (承接上文代码)  for root, dirs, files in os.walk(base_directory):     for file in files:         if file.endswith("form.py"):             file_path = os.path.join(root, file)             print(f"正在处理文件: {file_path}")              with open(file_path, "r", encoding="utf-8") as f:                 for line in f:                     stripped_line = line.strip()                     # 假设字典定义在形如 "def_options = {...}" 的单行中                     # 并且我们知道字典中包含 "name" 和 "age" 这样的键作为识别依据                     if "def_options =" in stripped_line and "name" in stripped_line and "age" in stripped_line:                         try:                             # 分割字符串，获取等号右侧的字典字符串部分                             dict_str_only = stripped_line.split("=", 1)[1].strip()                             # 使用 ast.literal_eval 安全地将字符串评估为Python字典                             extracted_dictionary = ast.literal_eval(dict_str_only)                              # 将提取的字典转换为DataFrame并追加                             # 注意：DataFrame([dict]) 会将字典的键作为列名，值作为行数据                             temp_df = pd.DataFrame([extracted_dictionary])                             all_data_df = pd.concat([all_data_df, temp_df], ignore_index=True)                             print(f"成功从 {file_path} 提取字典并添加到DataFrame。")                             break # 假设每个文件只包含一个目标字典，找到后即可停止读取当前文件                         except (SyntaxError, ValueError) as e:                             print(f"错误：无法从 {file_path} 评估字典字符串：{dict_str_only} - {e}")                         except IndexError:                             print(f"警告：{file_path} 中的行 '{stripped_line}' 格式不符合预期。")  # ... (后续可以打印 all_data_df 或进行其他操作)

关键技术点解析：

文件读取： with open(file_path, “r”, encoding=”utf-8″) as f: 以只读模式打开文件，并确保使用正确的编码。with 语句保证文件在使用完毕后自动关闭。
行处理： for line in f: 逐行读取文件内容。stripped_line = line.strip() 移除行首尾的空白字符。
字典行识别： if “def_options =” in stripped_line and “name” in stripped_line and “age” in stripped_line: 这是一个启发式的识别方法。我们假设字典定义以 “def_options =” 开头，并且包含 name 和 age 这样的特定键。在实际应用中，您可能需要根据您的字典定义模式调整此条件，例如，如果字典总是赋值给一个特定的变量名，可以直接检查该变量名。
字符串分割： dict_str_only = stripped_line.split(“=”, 1)[1].strip() 将行内容在第一个等号 = 处分割，取第二部分（即等号右侧的内容），并去除首尾空白。
安全评估： ast.literal_eval(dict_str_only) 是将字符串安全地转换为Python字面量（如字典、列表、数字、字符串等）的关键。它比 eval() 更安全，因为它只评估字面量，不会执行任意代码。
错误处理： 使用 try-except 块来捕获可能发生的 SyntaxError 或 ValueError（如果字典字符串格式不正确）以及 IndexError（如果 split 操作不成功），增强代码的健壮性。

4. 构建与合并 Pandas DataFrame

在每次成功提取字典后，我们将其转换为一个临时的 Pandas DataFrame，然后追加到主 DataFrame 中。

# ... (承接上文代码)  # 将提取的字典转换为DataFrame并追加 temp_df = pd.DataFrame([extracted_dictionary]) all_data_df = pd.concat([all_data_df, temp_df], ignore_index=True)

pd.DataFrame([extracted_dictionary])：将单个字典转换为DataFrame。注意，字典需要放在一个列表中，这样Pandas会将其视为一行数据，字典的键成为列名。
pd.concat([all_data_df, temp_df], ignore_index=True)：将临时的 temp_df 追加到 all_data_df。ignore_index=True 会重新生成连续的行索引，避免索引重复。

5. 完整代码示例

将上述所有步骤整合，形成一个完整的解决方案：

import os import pandas as pd import ast  def extract_dicts_to_dataframe(base_directory: str, filename_pattern: str = "form.py", dict_variable_name: str = "def_options") -> pd.DataFrame:     """     从指定目录下的Python文件中提取字典，并合并成一个Pandas DataFrame。      Args:         base_directory (str): 要搜索的根目录路径。         filename_pattern (str): 要查找的文件名模式，例如 "form.py"。         dict_variable_name (str): 字典在文件中赋值的变量名，例如 "def_options"。      Returns:         pd.DataFrame: 包含所有提取字典数据的DataFrame。     """      all_data_df = pd.DataFrame()      print(f"开始在目录 '{base_directory}' 中搜索 '{filename_pattern}' 文件...")      for root, dirs, files in os.walk(base_directory):         for file in files:             if file.endswith(filename_pattern):                 file_path = os.path.join(root, file)                 print(f"处理文件: {file_path}")                  with open(file_path, "r", encoding="utf-8") as f:                     for line_num, line in enumerate(f, 1):                         stripped_line = line.strip()                          # 更健壮的字典行识别：检查变量名和等号                         if stripped_line.startswith(f"{dict_variable_name} =") and "{" in stripped_line and "}" in stripped_line:                             try:                                 # 分割字符串，获取等号右侧的字典字符串部分                                 dict_str_only = stripped_line.split("=", 1)[1].strip()                                 # 使用 ast.literal_eval 安全地将字符串评估为Python字典                                 extracted_dictionary = ast.literal_eval(dict_str_only)                                  # 将提取的字典转换为DataFrame并追加                                 temp_df = pd.DataFrame([extracted_dictionary])                                 all_data_df = pd.concat([all_data_df, temp_df], ignore_index=True)                                 print(f"  成功从 {file_path} (行 {line_num}) 提取字典并添加到DataFrame。")                                 break # 假设每个文件只包含一个目标字典，找到后即可停止读取当前文件                             except (SyntaxError, ValueError) as e:                                 print(f"  错误：无法从 {file_path} (行 {line_num}) 评估字典字符串：'{dict_str_only}' - {e}")                             except IndexError:                                 print(f"  警告：{file_path} (行 {line_num}) 中的行 '{stripped_line}' 格式不符合预期。")      if all_data_df.empty:         print("未找到任何字典或未能成功提取任何字典。")     else:         print("n所有字典已成功合并到DataFrame中。")     return all_data_df  # --- 使用示例 --- # 请将此路径替换为您的实际根目录 # 例如：base_path = os.environ.get("JUPYTER_ROOT", ".") + "/charts/" base_path = "/home/jovyan/work/notebooks/charts/" # 示例路径  # 模拟创建一些文件用于测试 (可选) # import pathlib # test_dir = pathlib.Path(base_path) # test_dir.mkdir(parents=True, exist_ok=True) # (test_dir / "ahc_visits" / "booking_breakdown_per_age_group").mkdir(parents=True, exist_ok=True) # (test_dir / "ahc_visits" / "booking_breakdown_per_age_group" / "form.py").write_text('def_options = {"name": "Alice", "age": 30, "city": "New York"}n') # (test_dir / "another_module" / "sub_folder").mkdir(parents=True, exist_ok=True) # (test_dir / "another_module" / "sub_folder" / "form.py").write_text('def_options = {"name": "Bob", "age": 25, "city": "London", "occupation": "Engineer"}n') # (test_dir / "empty_folder").mkdir(parents=True, exist_ok=True) # (test_dir / "bad_format" / "form.py").mkdir(parents=True, exist_ok=True) # (test_dir / "bad_format" / "form.py").write_text('def_options = {"name": "Charlie", "age": 35, "city": "Paris", "occupation": "Doctor"n') # 缺少 }   result_df = extract_dicts_to_dataframe(base_path, dict_variable_name="def_options")  print("n最终的 Pandas DataFrame:") print(result_df)

6. 注意事项与最佳实践

字典识别的健壮性： 示例代码中的字典识别（stripped_line.startswith(f”{dict_variable_name} =”)）依赖于字典变量名和其赋值模式。如果您的 form.py 文件中的字典定义格式不一致（例如，字典可能定义在多行，或者赋值给不同的变量名），您需要调整识别逻辑。对于更复杂的Python文件解析，可以考虑使用Python的 ast 模块进行更深层次的抽象语法树分析，但这超出了本教程的范围。
ast.literal_eval() 的安全性： 始终优先使用 ast.literal_eval() 而不是 eval()。eval() 可以执行任意Python代码，存在严重的安全风险，而 ast.literal_eval() 仅限于评估Python字面量（字符串、数字、元组、列表、字典、布尔值和 None），因此更为安全。
编码问题： 在打开文件时，指定 encoding=”utf-8″ 是一个好的习惯，可以避免因文件编码不匹配而导致的 UnicodeDecodeError。

性能优化：如果要处理的文件数量非常庞大，频繁地 pd.concat 可能会影响性能。在这种情况下，更好的做法是先将所有提取的字典收集到一个列表中，然后一次性通过 pd.DataFrame(list_of_dicts) 创建最终的DataFrame。

# 优化后的DataFrame构建部分 all_extracted_dicts = [] # ... (在循环中，当成功提取字典后) # all_extracted_dicts.append(extracted_dictionary) # ... # 循环结束后 # if all_extracted_dicts: #     all_data_df = pd.DataFrame(all_extracted_dicts) # else: #     all_data_df = pd.DataFrame() # 或者其他空DataFrame初始化

处理空文件或无字典文件： 确保您的代码能够优雅地处理不包含目标字典的文件，或者完全是空的文件。

总结

本教程提供了一个从嵌套目录结构中的Python文件中提取字典数据并构建Pandas DataFrame的完整解决方案。通过结合 os.walk 进行文件遍历、文本处理技术（如字符串分割）以及 ast.literal_eval 的安全评估，您可以高效地将分散的结构化数据整合到统一的DataFrame中，为后续的数据分析和处理奠定基础。在实际应用中，请根据您的具体文件格式和安全性需求，调整字典识别和错误处理逻辑。

app for if pandas python python脚本 try 字符串性能优化数据分析数据结构编码递归

text=ZqhQzanResources