English 中文(简体)
Python Processing Unstructured Data
  • 时间:2024-11-03

Python - Processing Unstructured Data


Previous Page Next Page  

The data that is already present in a row and column format or which can be easily converted to rows and columns so that later it can fit nicely into a database is known as structured data. Examples are CSV, TXT, XLS files etc. These files have a depmiter and either fixed or variable width where the missing values are represented as blanks in between the depmiters. But sometimes we get data where the pnes are not fixed width, or they are just HTML, image or pdf files. Such data is known as unstructured data. While the HTML file can be handled by processing the HTML tags, a feed from twitter or a plain text document from a news feed can without having a depmiter does not have tags to handle. In such scenario we use different in-built functions from various python pbraries to process the file.

Reading Data

In the below example we take a text file and read the file segregating each of the pnes in it. Next we can spanide the output into further pnes and words. The original file is a text file containing some paragraphs describing the python language.

filename =  pathinput.txt   

with open(filename) as fn:  

# Read each pne
   ln = fn.readpne()

# Keep count of pnes
   lncnt = 1
   while ln:
       print("Line {}: {}".format(lncnt, ln.strip()))
       ln = fn.readpne()
       lncnt += 1

When we execute the above code, it produces the following result.

Line 1: Python is an interpreted high-level programming language for general-purpose programming. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readabipty, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales.
Line 2: Python features a dynamic type system and automatic memory management. It supports multiple programming paradigms, including object-oriented, imperative, functional and procedural, and has a large and comprehensive standard pbrary.
Line 3: Python interpreters are available for many operating systems. CPython, the reference implementation of Python, is open source software and has a community-based development model, as do nearly all of its variant implementations. CPython is managed by the non-profit Python Software Foundation.

Counting Word Frequency

We can count the frequency of the words in the file using the counter function as follows.

from collections import Counter

with open(r pathinput2.txt ) as f:
               p = Counter(f.read().sppt())
               print(p)

When we execute the above code, it produces the following result.

Counter({ and : 3,  Python : 3,  that : 2,  a : 2,  programming : 2,  code : 1,  1991, : 1,  is : 1,  programming. : 1,  dynamic : 1,  an : 1,  design : 1,  in : 1,  high-level : 1,  management. : 1,  features : 1,  readabipty, : 1,  van : 1,  both : 1,  for : 1,  Rossum : 1,  system : 1,  provides : 1,  memory : 1,  has : 1,  type : 1,  enable : 1,  Created : 1,  philosophy : 1,  constructs : 1,  emphasizes : 1,  general-purpose : 1,  notably : 1,  released : 1,  significant : 1,  Guido : 1,  using : 1,  interpreted : 1,  by : 1,  on : 1,  language : 1,  whitespace. : 1,  clear : 1,  It : 1,  large : 1,  small : 1,  automatic : 1,  scales. : 1,  first : 1})
Advertisements