Home > Databases, MS-Access, Programming > htmlcleaner using named pipes with microsoft access

htmlcleaner using named pipes with microsoft access

November 26th, 2014 Leave a comment Go to comments

I use htmlcleaner to scrape html and then clean bits and pieces before I do some intense custom parsing. I used to do it the easy way with htmlcleaner reading and writing temporary files but it was annoyingly slow. Eventually I broke down and figured out how to run this java code with redirected standard input, output, and error streams.

Microsoft provides some help but as usual they never go all the way and give you something really useful.

support.microsoft.com/kb/177696
support.microsoft.com/kb/178116

First I wrote a wrapper to make my html cleaning dead simple.

Here’s a simple wrapper to call htmlcleaner with my preferred options:

Private Const szPipeName = "\\.\pipe\htmlclean"
Private Const PIPES_BUFFSIZE = 65536

Public Const JAVA_EXE As String = "C:\Utils\Java\jre8\bin\java.exe"
Public Const CLEANER_EXE As String = "C:\Utils\htmlcleaner-2.9.r264s.jar"

Public Function HtmlCleaner(ByVal stdin As String, ByRef stdout As String, ByRef stderr As String) As Boolean

    Dim cmdline As String
    Dim cmd_options As String
    Dim WindowStyle As Long
    

    cmd_options = "outputtype=htmlcompact advancedxmlescape=false specialentities=false unicodechars=false omitcomments=true omitxmldecl=true omitdoctypedecl=true omithtmlenvelope=true useemptyelementtags=true"
     
    cmdline = """" & JAVA_EXE & """ -jar """ & CLEANER_EXE & """ src=""" & szPipeName & """ incharset=iso-8859-1 outcharset=UTF-8 " & cmd_options
        
    HtmlCleaner = ExecCmdPipe(cmdline, stdin, stdout, stderr)
    
totalfail1:
   
    'Beep

success:
    'ok
    
End Function

Now here’s the real code. I’m not claiming any authorship. I stole this from the net anywhere that had useful bits and pieces then hacked at it until it worked. It was such a huge pain that I thought I should share and maybe save someone else a lot of trouble.

I had a terrible time with blocking pipes but refused to give up until it worked. This is code that I am actually using and it works. It’s not some half-assed sample code. I call this function about 4500 times or more in a single run. Leaking handles and memory is an issue so I try to clean up well and handle errors in a half decent manner.

More importantly it’s fast, a lot faster than reading and writing temporary files, and I get the error output which I later database so I can check for problems.

It’s obviously just the function not the whole vb module so you can’t just copy and paste this alone. This code calls other things like CreateNamedPipe and CreateProcessA. These calls are all standard library stuff and you can find the supporting vb module code pretty much anywhere, if not already in the microsoft links mentioned above. I’ve included my win32 library stuff at the bottom of this post. I expect this is everything you’d need to run this code.

Public Function ExecCmdPipe(ByVal cmdline As String, ByVal stdin As String, ByRef stdout As String, ByRef stderr As String, Optional WindowStyle As Long = vbHide) As Boolean
        
    Dim tSA_CreatePipe              As SECURITY_ATTRIBUTES
    Dim tSA_CreateProcessPrc        As SECURITY_ATTRIBUTES
    Dim tSA_CreateProcessThrd       As SECURITY_ATTRIBUTES
    Dim tSA_CreateProcessPrcInfo    As PROCESS_INFORMATION
    Dim tStartupInfo                As STARTUPINFO
    Dim bRead                       As Long
    Dim abytBuff()                  As Byte
    Dim lngResult                   As Long
    Dim szFullCommand               As String
    Dim lngExitCode                 As Long
    Dim lngSizeOf                   As Long
     
    Dim ohRead                       As Long
    Dim ohWrite                      As Long
    
    Dim ehRead                       As Long
    Dim ehWrite                      As Long
        
    Dim pSD As Long
    Dim sa As SECURITY_ATTRIBUTES
    Dim hPipe As Long
    
    Dim dwWritten As Long
    Dim toWrite As Long
    Dim loops As Long
    
    Dim i As Long, l As Long, dwOpenMode As Long, dwPipeMode As Long
    Dim res As Long, nCount As Long, cbnCount As Long

    dwWritten = 0
    stdout = ""
    stderr = ""
    ExecCmdPipe = False
    res = 0
    
    tSA_CreatePipe.nLength = Len(tSA_CreatePipe)
    tSA_CreatePipe.lpSecurityDescriptor = 0&
    tSA_CreatePipe.bInheritHandle = True
     
    tSA_CreateProcessPrc.nLength = Len(tSA_CreateProcessPrc)
    tSA_CreateProcessThrd.nLength = Len(tSA_CreateProcessThrd)
     
    If (CreatePipe(ohRead, ohWrite, tSA_CreatePipe, 0&) = 0&) Then GoTo totalfail1
    
    ' Ensure the read handle to the pipe for STDOUT is not inherited.

    If (SetHandleInformation(ohRead, HANDLE_FLAG_INHERIT, 0) = 0) Then GoTo totalfail2

    
    If (CreatePipe(ehRead, ehWrite, tSA_CreatePipe, 0&) = 0&) Then GoTo totalfail2
    
    ' Ensure the read handle to the pipe for STDERR is not inherited.
    If (SetHandleInformation(ehRead, HANDLE_FLAG_INHERIT, 0) = 0) Then GoTo totalfail3
    
    tStartupInfo.cb = Len(tStartupInfo)
    GetStartupInfo tStartupInfo
    
    With tStartupInfo
        .cb = Len(tStartupInfo) ' Initialize the STARTUPINFO structure:
        .hStdOutput = ohWrite
        .hStdError = ehWrite

        .dwFlags = STARTF_USESTDHANDLES
        
        If Not IsMissing(WindowStyle) Then
            .wShowWindow = WindowStyle
        Else
            .wShowWindow = SW_HIDE
        End If
        
        .dwFlags = .dwFlags Or STARTF_USESHOWWINDOW

    End With
    
    
    'Create the NULL security token for the pipe
    pSD = GlobalAlloc(GPTR, SECURITY_DESCRIPTOR_MIN_LENGTH)
    res = InitializeSecurityDescriptor(pSD, SECURITY_DESCRIPTOR_REVISION)
    res = SetSecurityDescriptorDacl(pSD, -1, 0, 0)
    sa.nLength = LenB(sa)
    sa.lpSecurityDescriptor = pSD
    sa.bInheritHandle = False
    
    'Create the Named Pipe
    dwOpenMode = PIPE_ACCESS_OUTBOUND Or FILE_FLAG_WRITE_THROUGH Or FILE_FLAG_FIRST_PIPE_INSTANCE
    dwPipeMode = PIPE_WAIT ' Or PIPE_TYPE_MESSAGE Or PIPE_READMODE_MESSAGE
    hPipe = CreateNamedPipe(szPipeName, dwOpenMode, dwPipeMode, 2, PIPES_BUFFSIZE, PIPES_BUFFSIZE, 0, sa)
    If (hPipe = 0) Then GoTo totalfail4
    

    DoEvents
    
    
    ' Start the shelled application:
    res = CreateProcessA(0&, cmdline, 0&, 0&, 1&, NORMAL_PRIORITY_CLASS, 0&, 0&, tStartupInfo, tSA_CreateProcessPrcInfo)
    
    If (res = 0) Then GoTo totalfail5
    
    ' Wait for the app to connect to the named pipe
    
    res = ConnectNamedPipe(hPipe, ByVal 0)
    
    If (res = 0) Then
        res = Err.LastDllError 'GetLastError()
        If res <> ERROR_PIPE_CONNECTED Then GoTo totalfail5
    End If
    
    i = 0
    l = Len(stdin)
    If l = 0 Then
        GoTo skipwrite ' wtf???
    End If
    
    Do
        dwWritten = 0
        
        toWrite = l - i
        If toWrite > (PIPES_BUFFSIZE - 36) Then toWrite = PIPES_BUFFSIZE - 36
        
        res = WriteFile(hPipe, Mid(stdin, i + 1, toWrite), toWrite, dwWritten, ByVal 0&)
        i = i + dwWritten
    Loop Until (i >= l) Or (res <> 1)


    DoEvents

' maybe have to read error buffer before FlushFileBuffers or we stall

        lngSizeOf = GetFileSize(ehRead, 0&)
        If (lngSizeOf > 0) Then
            ReDim abytBuff(lngSizeOf - 1)
            
            If ReadFile(ehRead, abytBuff(0), UBound(abytBuff) + 1, bRead, ByVal 0&) Then
                 stderr = stderr & StrConv(abytBuff, vbUnicode)
            End If
        End If
        
        lngSizeOf = GetFileSize(ohRead, 0&)
        If (lngSizeOf > 0) Then
            ReDim abytBuff(lngSizeOf - 1)
            
            If ReadFile(ohRead, abytBuff(0), UBound(abytBuff) + 1, bRead, ByVal 0&) Then
                stdout = stdout & StrConv(abytBuff, vbUnicode)
            End If
        End If


    DoEvents

    
skipwrite:

    res = CloseHandle(hPipe)
   
    loops = 0
   
    Do
        res = WaitForSingleObject(tSA_CreateProcessPrcInfo.hProcess, 50)
        
        DoEvents
                
        lngSizeOf = GetFileSize(ehRead, 0&)
        If (lngSizeOf > 0) Then
            ReDim abytBuff(lngSizeOf - 1)
            
            If ReadFile(ehRead, abytBuff(0), UBound(abytBuff) + 1, bRead, ByVal 0&) Then
                 stderr = stderr & StrConv(abytBuff, vbUnicode)
            End If
        End If
        
        lngSizeOf = GetFileSize(ohRead, 0&)
        If (lngSizeOf > 0) Then
            ReDim abytBuff(lngSizeOf - 1)
            
            If ReadFile(ohRead, abytBuff(0), UBound(abytBuff) + 1, bRead, ByVal 0&) Then
                stdout = stdout & StrConv(abytBuff, vbUnicode)
            End If
        End If

        loops = loops + 1
               
    Loop Until (res <> 258) Or (loops > 100)
    
    If loops > 100 Then
    
        Debug.Print "ExecCmdPipe: HUNG_PROCESS thread: " & tSA_CreateProcessPrcInfo.hThread & vbLf & " handle: " & tSA_CreateProcessPrcInfo.hProcess & vbLf & " cmd: " & cmdline
    
    End If
    
    
    res = CloseHandle(tSA_CreateProcessPrcInfo.hThread)
    res = CloseHandle(tSA_CreateProcessPrcInfo.hProcess)
    
    res = CloseHandle(ohWrite)
    res = CloseHandle(ohRead)
    
    res = CloseHandle(ehWrite)
    res = CloseHandle(ehRead)
           
    res = CloseHandle(hPipe)
    res = GlobalFree(pSD)
   
    GoTo success
    
totalfail6:
    
    res = DisconnectNamedPipe(hPipe)

totalfail5:

    res = CloseHandle(hPipe)

totalfail4:

    res = GlobalFree(pSD)
   
totalfail3:
    
    res = CloseHandle(ehWrite)
    res = CloseHandle(ehRead)
   
totalfail2:
   
    res = CloseHandle(ohWrite)
    res = CloseHandle(ohRead)
   
totalfail1:

    ExecCmdPipe = False
   
    Beep
    
    GoTo theend

success:
    ExecCmdPipe = True
    
theend:

End Function

Here’s some of the win32 library stuff you’ll need. This is for a general library of win32 calls, more than I need for this pipe code.

Option Compare Text    ' So string comparisons aren't case-sensitive
Option Explicit


 
Private Const szPipeName = "\\.\pipe\htmlclean"
Private Const PIPES_BUFFSIZE = 65536
'Private BigBuffer(BUFFSIZE) As Byte
 
Public Const FILE_ATTRIBUTE_NORMAL = &H80
Public Const FILE_FLAG_NO_BUFFERING = &H20000000
Public Const FILE_FLAG_WRITE_THROUGH = &H80000000
Public Const FILE_FLAG_FIRST_PIPE_INSTANCE = &H80000
Public Const ERROR_INVALID_HANDLE = &H6
Public Const ERROR_BROKEN_PIPE = 109
Public Const ERROR_IO_PENDING = 997
Public Const ERROR_PIPE_BUSY = 231
Public Const ERROR_NO_DATA = 232
Public Const ERROR_PIPE_NOT_CONNECTED = 233
Public Const ERROR_PIPE_CONNECTED = 535
Public Const ERROR_PIPE_LISTENING = 536

Public Const PIPE_ACCESS_OUTBOUND = &H2
Public Const PIPE_ACCESS_DUPLEX = &H3
Public Const PIPE_READMODE_MESSAGE = &H2
Public Const PIPE_READMODE_BYTE = &H0
Public Const PIPE_TYPE_MESSAGE = &H4
Public Const PIPE_WAIT = &H0

Public Const HANDLE_FLAG_INHERIT = &H1
Public Const HANDLE_FLAG_PROTECT_FROM_CLOSE = &H2

Public Const INVALID_HANDLE_VALUE = -1

Public Const SECURITY_DESCRIPTOR_MIN_LENGTH = (20)
Public Const SECURITY_DESCRIPTOR_REVISION = (1)
 
    
Private Const WAIT_INFINITE         As Long = (-1&)
'Private Const STARTF_USESHOWWINDOW  As Long = &H1
Private Const STARTF_USESTDHANDLES  As Long = &H100
Private Const SW_HIDE               As Long = 0&

Private Const STARTF_USESHOWWINDOW& = &H1
Private Const NORMAL_PRIORITY_CLASS = &H20&
Private Const INFINITE = -1&

Public Const GMEM_FIXED = &H0
Public Const GMEM_ZEROINIT = &H40
Public Const GPTR = (GMEM_FIXED Or GMEM_ZEROINIT)

Private Type STARTUPINFO
   cb As Long
   lpReserved As String
   lpDesktop As String
   lpTitle As String
   dwX As Long
   dwY As Long
   dwXSize As Long
   dwYSize As Long
   dwXCountChars As Long
   dwYCountChars As Long
   dwFillAttribute As Long
   dwFlags As Long
   wShowWindow As Integer
   cbReserved2 As Integer
   lpReserved2 As Long
   hStdInput As Long
   hStdOutput As Long
   hStdError As Long
End Type

Private Type PROCESS_INFORMATION
   hProcess As Long
   hThread As Long
   dwProcessId As Long
   dwThreadId As Long
End Type

Private Type SECURITY_ATTRIBUTES
    nLength As Long
    lpSecurityDescriptor As Long
    bInheritHandle As Long
End Type
    
Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
Private Declare Function GetLastError Lib "kernel32.dll" () As Long
Private Declare Function WaitForSingleObject Lib "kernel32" (ByVal hHandle As Long, ByVal dwMilliseconds As Long) As Long

Private Declare Function CreateProcessA Lib "kernel32" (ByVal _
   lpApplicationName As Long, ByVal lpCommandLine As String, ByVal _
   lpProcessAttributes As Long, ByVal lpThreadAttributes As Long, _
   ByVal bInheritHandles As Long, ByVal dwCreationFlags As Long, _
   ByVal lpEnvironment As Long, ByVal lpCurrentDirectory As Long, _
   lpStartupInfo As STARTUPINFO, lpProcessInformation As _
   PROCESS_INFORMATION) As Long

Private Declare Function SetHandleInformation Lib "kernel32.dll" ( _
                 ByVal hObject As Long, _
                 ByVal dwMask As Long, _
                 ByVal dwFlags As Long) As Long
                 
Private Declare Function CloseHandle Lib "kernel32" (ByVal hObject As Long) As Long

Private Declare Function CreatePipe Lib "kernel32" (phReadPipe As Long, phWritePipe As Long, lpPipeAttributes As SECURITY_ATTRIBUTES, ByVal nSize As Long) As Long
Private Declare Function ReadFile Lib "kernel32" (ByVal hFile As Long, lpBuffer As Any, ByVal nNumberOfBytesToRead As Long, lpNumberOfBytesRead As Long, lpOverlapped As Any) As Long
Private Declare Function WriteFile Lib "kernel32" (ByVal hFile As Long, ByVal lpBuffer As String, ByVal nNumberOfBytesToWrite As Long, lpNumberOfBytesWritten As Long, lpOverlapped As Any) As Long

Private Declare Function FlushFileBuffers Lib "kernel32" (ByVal hFile As Long) As Long

Private Declare Function GetExitCodeProcess Lib "kernel32" (ByVal hProcess As Long, lpExitCode As Long) As Long
Private Declare Sub GetStartupInfo Lib "kernel32" Alias "GetStartupInfoA" (lpStartupInfo As STARTUPINFO)
Private Declare Function GetFileSize Lib "kernel32" (ByVal hFile As Long, lpFileSizeHigh As Long) As Long

Private Declare Function GlobalAlloc Lib "kernel32" ( _
   ByVal wFlags As Long, ByVal dwBytes As Long) As Long

Private Declare Function GlobalFree Lib "kernel32" (ByVal hMem As Long) As Long

Private Declare Function CreateNamedPipe Lib "kernel32" Alias _
   "CreateNamedPipeA" ( _
   ByVal lpName As String, _
   ByVal dwOpenMode As Long, _
   ByVal dwPipeMode As Long, _
   ByVal nMaxInstances As Long, _
   ByVal nOutBufferSize As Long, _
   ByVal nInBufferSize As Long, _
   ByVal nDefaultTimeOut As Long, _
   lpSecurityAttributes As Any) As Long

Private Declare Function InitializeSecurityDescriptor Lib "advapi32.dll" ( _
   ByVal pSecurityDescriptor As Long, _
   ByVal dwRevision As Long) As Long

Private Declare Function SetSecurityDescriptorDacl Lib "advapi32.dll" ( _
   ByVal pSecurityDescriptor As Long, _
   ByVal bDaclPresent As Long, _
   ByVal pDacl As Long, _
   ByVal bDaclDefaulted As Long) As Long

Private Declare Function ConnectNamedPipe Lib "kernel32" ( _
   ByVal hNamedPipe As Long, _
   lpOverlapped As Any) As Long

Private Declare Function DisconnectNamedPipe Lib "kernel32" ( _
   ByVal hNamedPipe As Long) As Long

   
  1. No comments yet.