htmlcleaner using named pipes with microsoft access
I use htmlcleaner to scrape html and then clean bits and pieces before I do some intense custom parsing. I used to do it the easy way with htmlcleaner reading and writing temporary files but it was annoyingly slow. Eventually I broke down and figured out how to run this java code with redirected standard input, output, and error streams.
Microsoft provides some help but as usual they never go all the way and give you something really useful.
support.microsoft.com/kb/177696
support.microsoft.com/kb/178116
First I wrote a wrapper to make my html cleaning dead simple.
Here’s a simple wrapper to call htmlcleaner with my preferred options:
Private Const szPipeName = "\\.\pipe\htmlclean" Private Const PIPES_BUFFSIZE = 65536 Public Const JAVA_EXE As String = "C:\Utils\Java\jre8\bin\java.exe" Public Const CLEANER_EXE As String = "C:\Utils\htmlcleaner-2.9.r264s.jar" Public Function HtmlCleaner(ByVal stdin As String, ByRef stdout As String, ByRef stderr As String) As Boolean Dim cmdline As String Dim cmd_options As String Dim WindowStyle As Long cmd_options = "outputtype=htmlcompact advancedxmlescape=false specialentities=false unicodechars=false omitcomments=true omitxmldecl=true omitdoctypedecl=true omithtmlenvelope=true useemptyelementtags=true" cmdline = """" & JAVA_EXE & """ -jar """ & CLEANER_EXE & """ src=""" & szPipeName & """ incharset=iso-8859-1 outcharset=UTF-8 " & cmd_options HtmlCleaner = ExecCmdPipe(cmdline, stdin, stdout, stderr) totalfail1: 'Beep success: 'ok End Function
Now here’s the real code. I’m not claiming any authorship. I stole this from the net anywhere that had useful bits and pieces then hacked at it until it worked. It was such a huge pain that I thought I should share and maybe save someone else a lot of trouble.
I had a terrible time with blocking pipes but refused to give up until it worked. This is code that I am actually using and it works. It’s not some half-assed sample code. I call this function about 4500 times or more in a single run. Leaking handles and memory is an issue so I try to clean up well and handle errors in a half decent manner.
More importantly it’s fast, a lot faster than reading and writing temporary files, and I get the error output which I later database so I can check for problems.
It’s obviously just the function not the whole vb module so you can’t just copy and paste this alone. This code calls other things like CreateNamedPipe and CreateProcessA. These calls are all standard library stuff and you can find the supporting vb module code pretty much anywhere, if not already in the microsoft links mentioned above. I’ve included my win32 library stuff at the bottom of this post. I expect this is everything you’d need to run this code.
Public Function ExecCmdPipe(ByVal cmdline As String, ByVal stdin As String, ByRef stdout As String, ByRef stderr As String, Optional WindowStyle As Long = vbHide) As Boolean Dim tSA_CreatePipe As SECURITY_ATTRIBUTES Dim tSA_CreateProcessPrc As SECURITY_ATTRIBUTES Dim tSA_CreateProcessThrd As SECURITY_ATTRIBUTES Dim tSA_CreateProcessPrcInfo As PROCESS_INFORMATION Dim tStartupInfo As STARTUPINFO Dim bRead As Long Dim abytBuff() As Byte Dim lngResult As Long Dim szFullCommand As String Dim lngExitCode As Long Dim lngSizeOf As Long Dim ohRead As Long Dim ohWrite As Long Dim ehRead As Long Dim ehWrite As Long Dim pSD As Long Dim sa As SECURITY_ATTRIBUTES Dim hPipe As Long Dim dwWritten As Long Dim toWrite As Long Dim loops As Long Dim i As Long, l As Long, dwOpenMode As Long, dwPipeMode As Long Dim res As Long, nCount As Long, cbnCount As Long dwWritten = 0 stdout = "" stderr = "" ExecCmdPipe = False res = 0 tSA_CreatePipe.nLength = Len(tSA_CreatePipe) tSA_CreatePipe.lpSecurityDescriptor = 0& tSA_CreatePipe.bInheritHandle = True tSA_CreateProcessPrc.nLength = Len(tSA_CreateProcessPrc) tSA_CreateProcessThrd.nLength = Len(tSA_CreateProcessThrd) If (CreatePipe(ohRead, ohWrite, tSA_CreatePipe, 0&) = 0&) Then GoTo totalfail1 ' Ensure the read handle to the pipe for STDOUT is not inherited. If (SetHandleInformation(ohRead, HANDLE_FLAG_INHERIT, 0) = 0) Then GoTo totalfail2 If (CreatePipe(ehRead, ehWrite, tSA_CreatePipe, 0&) = 0&) Then GoTo totalfail2 ' Ensure the read handle to the pipe for STDERR is not inherited. If (SetHandleInformation(ehRead, HANDLE_FLAG_INHERIT, 0) = 0) Then GoTo totalfail3 tStartupInfo.cb = Len(tStartupInfo) GetStartupInfo tStartupInfo With tStartupInfo .cb = Len(tStartupInfo) ' Initialize the STARTUPINFO structure: .hStdOutput = ohWrite .hStdError = ehWrite .dwFlags = STARTF_USESTDHANDLES If Not IsMissing(WindowStyle) Then .wShowWindow = WindowStyle Else .wShowWindow = SW_HIDE End If .dwFlags = .dwFlags Or STARTF_USESHOWWINDOW End With 'Create the NULL security token for the pipe pSD = GlobalAlloc(GPTR, SECURITY_DESCRIPTOR_MIN_LENGTH) res = InitializeSecurityDescriptor(pSD, SECURITY_DESCRIPTOR_REVISION) res = SetSecurityDescriptorDacl(pSD, -1, 0, 0) sa.nLength = LenB(sa) sa.lpSecurityDescriptor = pSD sa.bInheritHandle = False 'Create the Named Pipe dwOpenMode = PIPE_ACCESS_OUTBOUND Or FILE_FLAG_WRITE_THROUGH Or FILE_FLAG_FIRST_PIPE_INSTANCE dwPipeMode = PIPE_WAIT ' Or PIPE_TYPE_MESSAGE Or PIPE_READMODE_MESSAGE hPipe = CreateNamedPipe(szPipeName, dwOpenMode, dwPipeMode, 2, PIPES_BUFFSIZE, PIPES_BUFFSIZE, 0, sa) If (hPipe = 0) Then GoTo totalfail4 DoEvents ' Start the shelled application: res = CreateProcessA(0&, cmdline, 0&, 0&, 1&, NORMAL_PRIORITY_CLASS, 0&, 0&, tStartupInfo, tSA_CreateProcessPrcInfo) If (res = 0) Then GoTo totalfail5 ' Wait for the app to connect to the named pipe res = ConnectNamedPipe(hPipe, ByVal 0) If (res = 0) Then res = Err.LastDllError 'GetLastError() If res <> ERROR_PIPE_CONNECTED Then GoTo totalfail5 End If i = 0 l = Len(stdin) If l = 0 Then GoTo skipwrite ' wtf??? End If Do dwWritten = 0 toWrite = l - i If toWrite > (PIPES_BUFFSIZE - 36) Then toWrite = PIPES_BUFFSIZE - 36 res = WriteFile(hPipe, Mid(stdin, i + 1, toWrite), toWrite, dwWritten, ByVal 0&) i = i + dwWritten Loop Until (i >= l) Or (res <> 1) DoEvents ' maybe have to read error buffer before FlushFileBuffers or we stall lngSizeOf = GetFileSize(ehRead, 0&) If (lngSizeOf > 0) Then ReDim abytBuff(lngSizeOf - 1) If ReadFile(ehRead, abytBuff(0), UBound(abytBuff) + 1, bRead, ByVal 0&) Then stderr = stderr & StrConv(abytBuff, vbUnicode) End If End If lngSizeOf = GetFileSize(ohRead, 0&) If (lngSizeOf > 0) Then ReDim abytBuff(lngSizeOf - 1) If ReadFile(ohRead, abytBuff(0), UBound(abytBuff) + 1, bRead, ByVal 0&) Then stdout = stdout & StrConv(abytBuff, vbUnicode) End If End If DoEvents skipwrite: res = CloseHandle(hPipe) loops = 0 Do res = WaitForSingleObject(tSA_CreateProcessPrcInfo.hProcess, 50) DoEvents lngSizeOf = GetFileSize(ehRead, 0&) If (lngSizeOf > 0) Then ReDim abytBuff(lngSizeOf - 1) If ReadFile(ehRead, abytBuff(0), UBound(abytBuff) + 1, bRead, ByVal 0&) Then stderr = stderr & StrConv(abytBuff, vbUnicode) End If End If lngSizeOf = GetFileSize(ohRead, 0&) If (lngSizeOf > 0) Then ReDim abytBuff(lngSizeOf - 1) If ReadFile(ohRead, abytBuff(0), UBound(abytBuff) + 1, bRead, ByVal 0&) Then stdout = stdout & StrConv(abytBuff, vbUnicode) End If End If loops = loops + 1 Loop Until (res <> 258) Or (loops > 100) If loops > 100 Then Debug.Print "ExecCmdPipe: HUNG_PROCESS thread: " & tSA_CreateProcessPrcInfo.hThread & vbLf & " handle: " & tSA_CreateProcessPrcInfo.hProcess & vbLf & " cmd: " & cmdline End If res = CloseHandle(tSA_CreateProcessPrcInfo.hThread) res = CloseHandle(tSA_CreateProcessPrcInfo.hProcess) res = CloseHandle(ohWrite) res = CloseHandle(ohRead) res = CloseHandle(ehWrite) res = CloseHandle(ehRead) res = CloseHandle(hPipe) res = GlobalFree(pSD) GoTo success totalfail6: res = DisconnectNamedPipe(hPipe) totalfail5: res = CloseHandle(hPipe) totalfail4: res = GlobalFree(pSD) totalfail3: res = CloseHandle(ehWrite) res = CloseHandle(ehRead) totalfail2: res = CloseHandle(ohWrite) res = CloseHandle(ohRead) totalfail1: ExecCmdPipe = False Beep GoTo theend success: ExecCmdPipe = True theend: End Function
Here’s some of the win32 library stuff you’ll need. This is for a general library of win32 calls, more than I need for this pipe code.
Option Compare Text ' So string comparisons aren't case-sensitive Option Explicit Private Const szPipeName = "\\.\pipe\htmlclean" Private Const PIPES_BUFFSIZE = 65536 'Private BigBuffer(BUFFSIZE) As Byte Public Const FILE_ATTRIBUTE_NORMAL = &H80 Public Const FILE_FLAG_NO_BUFFERING = &H20000000 Public Const FILE_FLAG_WRITE_THROUGH = &H80000000 Public Const FILE_FLAG_FIRST_PIPE_INSTANCE = &H80000 Public Const ERROR_INVALID_HANDLE = &H6 Public Const ERROR_BROKEN_PIPE = 109 Public Const ERROR_IO_PENDING = 997 Public Const ERROR_PIPE_BUSY = 231 Public Const ERROR_NO_DATA = 232 Public Const ERROR_PIPE_NOT_CONNECTED = 233 Public Const ERROR_PIPE_CONNECTED = 535 Public Const ERROR_PIPE_LISTENING = 536 Public Const PIPE_ACCESS_OUTBOUND = &H2 Public Const PIPE_ACCESS_DUPLEX = &H3 Public Const PIPE_READMODE_MESSAGE = &H2 Public Const PIPE_READMODE_BYTE = &H0 Public Const PIPE_TYPE_MESSAGE = &H4 Public Const PIPE_WAIT = &H0 Public Const HANDLE_FLAG_INHERIT = &H1 Public Const HANDLE_FLAG_PROTECT_FROM_CLOSE = &H2 Public Const INVALID_HANDLE_VALUE = -1 Public Const SECURITY_DESCRIPTOR_MIN_LENGTH = (20) Public Const SECURITY_DESCRIPTOR_REVISION = (1) Private Const WAIT_INFINITE As Long = (-1&) 'Private Const STARTF_USESHOWWINDOW As Long = &H1 Private Const STARTF_USESTDHANDLES As Long = &H100 Private Const SW_HIDE As Long = 0& Private Const STARTF_USESHOWWINDOW& = &H1 Private Const NORMAL_PRIORITY_CLASS = &H20& Private Const INFINITE = -1& Public Const GMEM_FIXED = &H0 Public Const GMEM_ZEROINIT = &H40 Public Const GPTR = (GMEM_FIXED Or GMEM_ZEROINIT) Private Type STARTUPINFO cb As Long lpReserved As String lpDesktop As String lpTitle As String dwX As Long dwY As Long dwXSize As Long dwYSize As Long dwXCountChars As Long dwYCountChars As Long dwFillAttribute As Long dwFlags As Long wShowWindow As Integer cbReserved2 As Integer lpReserved2 As Long hStdInput As Long hStdOutput As Long hStdError As Long End Type Private Type PROCESS_INFORMATION hProcess As Long hThread As Long dwProcessId As Long dwThreadId As Long End Type Private Type SECURITY_ATTRIBUTES nLength As Long lpSecurityDescriptor As Long bInheritHandle As Long End Type Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long) Private Declare Function GetLastError Lib "kernel32.dll" () As Long Private Declare Function WaitForSingleObject Lib "kernel32" (ByVal hHandle As Long, ByVal dwMilliseconds As Long) As Long Private Declare Function CreateProcessA Lib "kernel32" (ByVal _ lpApplicationName As Long, ByVal lpCommandLine As String, ByVal _ lpProcessAttributes As Long, ByVal lpThreadAttributes As Long, _ ByVal bInheritHandles As Long, ByVal dwCreationFlags As Long, _ ByVal lpEnvironment As Long, ByVal lpCurrentDirectory As Long, _ lpStartupInfo As STARTUPINFO, lpProcessInformation As _ PROCESS_INFORMATION) As Long Private Declare Function SetHandleInformation Lib "kernel32.dll" ( _ ByVal hObject As Long, _ ByVal dwMask As Long, _ ByVal dwFlags As Long) As Long Private Declare Function CloseHandle Lib "kernel32" (ByVal hObject As Long) As Long Private Declare Function CreatePipe Lib "kernel32" (phReadPipe As Long, phWritePipe As Long, lpPipeAttributes As SECURITY_ATTRIBUTES, ByVal nSize As Long) As Long Private Declare Function ReadFile Lib "kernel32" (ByVal hFile As Long, lpBuffer As Any, ByVal nNumberOfBytesToRead As Long, lpNumberOfBytesRead As Long, lpOverlapped As Any) As Long Private Declare Function WriteFile Lib "kernel32" (ByVal hFile As Long, ByVal lpBuffer As String, ByVal nNumberOfBytesToWrite As Long, lpNumberOfBytesWritten As Long, lpOverlapped As Any) As Long Private Declare Function FlushFileBuffers Lib "kernel32" (ByVal hFile As Long) As Long Private Declare Function GetExitCodeProcess Lib "kernel32" (ByVal hProcess As Long, lpExitCode As Long) As Long Private Declare Sub GetStartupInfo Lib "kernel32" Alias "GetStartupInfoA" (lpStartupInfo As STARTUPINFO) Private Declare Function GetFileSize Lib "kernel32" (ByVal hFile As Long, lpFileSizeHigh As Long) As Long Private Declare Function GlobalAlloc Lib "kernel32" ( _ ByVal wFlags As Long, ByVal dwBytes As Long) As Long Private Declare Function GlobalFree Lib "kernel32" (ByVal hMem As Long) As Long Private Declare Function CreateNamedPipe Lib "kernel32" Alias _ "CreateNamedPipeA" ( _ ByVal lpName As String, _ ByVal dwOpenMode As Long, _ ByVal dwPipeMode As Long, _ ByVal nMaxInstances As Long, _ ByVal nOutBufferSize As Long, _ ByVal nInBufferSize As Long, _ ByVal nDefaultTimeOut As Long, _ lpSecurityAttributes As Any) As Long Private Declare Function InitializeSecurityDescriptor Lib "advapi32.dll" ( _ ByVal pSecurityDescriptor As Long, _ ByVal dwRevision As Long) As Long Private Declare Function SetSecurityDescriptorDacl Lib "advapi32.dll" ( _ ByVal pSecurityDescriptor As Long, _ ByVal bDaclPresent As Long, _ ByVal pDacl As Long, _ ByVal bDaclDefaulted As Long) As Long Private Declare Function ConnectNamedPipe Lib "kernel32" ( _ ByVal hNamedPipe As Long, _ lpOverlapped As Any) As Long Private Declare Function DisconnectNamedPipe Lib "kernel32" ( _ ByVal hNamedPipe As Long) As Long
Recent Comments